This notebook explores how diet quality relates to cardiometabolic risk and care in recent NHANES data. Diet is often treated as a single “healthy vs. unhealthy” lever for preventing diabetes and hypertension, but in practice it is tangled up with age, income, behavior bundles (smoking, heavy drinking), and patterns of contact with the health system.
Our goal here is to build an integrated NHANES analytic dataset (demographics, exams, disease/health-care questionnaires, and 24-hour dietary recalls) and use visualization to examine when and for whom a “better” diet shows up as better outcomes. Concretely, we ask: 1. How does diet quality relate to hypertension and diabetes across age groups and levels of adverse health behaviors (Behavior Load Index)? 2. Among adults with diagnosed hypertension or diabetes, does higher diet quality line up with better control, and does this differ by income? 3. Beyond a single diet score, where do the highest-risk individuals sit in the joint space of diet score, total energy, and sodium intake?
The following code chunks construct the analysis dataset, define our composite indices, and generate the figures used to answer these questions.
In this notebook I work with NHANES 2021–2023 data to explore how diet quality and a multi‐factor behavior load index (BLI) relate to access to care, diagnosis, and control of hypertension and diabetes. I use tidyverse tools for data engineering, then build a series of heatmaps, ridge plots, pathway bar charts, and ternary scatterplots to summarize these patterns.
The chunk below loads all required packages and sets a consistent minimalist theme for all figures.
NHANES is distributed across many component files (demographics, diet recalls, questionnaires, exam and lab data). Here I:
SEQN (person
ID).df
with exactly one row per SEQN.# Named vector of NHANES XPT component files
paths <- c(
DEMO = "Data 237 Final_Datasets/DEMO_L.xpt", # demographics (weights, age, sex, race)
DR1TOT = "Data 237 Final_Datasets/dietary/DR1TOT_L.xpt", # day 1 total nutrient intake
DR2TOT = "Data 237 Final_Datasets/dietary/DR2TOT_L.xpt", # day 2 total nutrient intake
DR1IFF = "Data 237 Final_Datasets/dietary/DR1IFF_L.xpt", # day 1 individual foods file
DR2IFF = "Data 237 Final_Datasets/dietary/DR2IFF_L.xpt", # day 2 individual foods file
DPQ = "Data 237 Final_Datasets/health outcome/DPQ_L.xpt", # depression questionnaire
HSQ = "Data 237 Final_Datasets/Other/HSQ_L.xpt", # health status
MCQ = "Data 237 Final_Datasets/health outcome/MCQ_L.xpt", # medical conditions
BPQ = "Data 237 Final_Datasets/health outcome/BPQ_L.xpt", # blood pressure questionnaire
DIQ = "Data 237 Final_Datasets/health outcome/DIQ_L.xpt", # diabetes questionnaire
HUQ = "Data 237 Final_Datasets/Other/HUQ_L.xpt", # health care utilization
INQ = "Data 237 Final_Datasets/Other/INQ_L.xpt", # income and program participation
SMQ = "Data 237 Final_Datasets/Other/SMQ_L.xpt", # smoking
ALQ = "Data 237 Final_Datasets/Other/ALQ_L.xpt", # alcohol use
PAQ = "Data 237 Final_Datasets/Other/PAQ_L.xpt", # physical activity
SLQ = "Data 237 Final_Datasets/Other/SLQ_L.xpt", # sleep
DBQ = "Data 237 Final_Datasets/Other/DBQ_L.xpt", # diet behavior and nutrition
RXQ = "Data 237 Final_Datasets/Other/RXQ_RX_L.xpt", # prescription medications
# Optional measured anchors (Exam/Lab) for later analyses:
BMX = "Data 237 Final_Datasets/Other/BMX_L.xpt", # body measures (BMI, etc.)
BPXO = "Data 237 Final_Datasets/Other/BPXO_L.xpt", # oscillometric blood pressure
GHB = "Data 237 Final_Datasets/Other/GHB_L.xpt", # glycohemoglobin
TCHOL = "Data 237 Final_Datasets/Other/TCHOL_L.xpt" # total cholesterol
)
# Read each component that (a) exists on disk and (b) has SEQN (person identifier)
raw <- list()
for (nm in names(paths)) {
p <- paths[[nm]]
if (!file.exists(p)) next # silently skip missing files
dat <- read_xpt(p) # read XPT with haven
if (!"SEQN" %in% names(dat)) next # keep only person-indexed files
raw[[nm]] <- dat
}
Next, I generate a quick duplicate report that summarizes how many rows per SEQN each component has. This helps distinguish long-form tables (e.g., diet items, multiple BP readings) from already person-level tables.
# For each component table, count rows per SEQN to see if it is long vs. wide
dup_report <- map_df(
names(raw),
~{
x <- raw[[.x]] %>% count(SEQN, name = "n_rows")
tibble(
file = .x,
n_persons = n_distinct(x$SEQN),
max_rows_per_seqn = max(x$n_rows, na.rm = TRUE),
pct_multi = mean(x$n_rows > 1) * 100 # % of persons with >1 row
)
}
)
# Print the report ordered by maximum rows per SEQN
print(dup_report[order(-dup_report$max_rows_per_seqn), ], n = Inf)
## # A tibble: 22 × 4
## file n_persons max_rows_per_seqn pct_multi
## <chr> <int> <int> <dbl>
## 1 DR1IFF 6751 49 99.9
## 2 DR2IFF 5879 40 100.0
## 3 DEMO 11933 1 0
## 4 DR1TOT 8860 1 0
## 5 DR2TOT 8860 1 0
## 6 DPQ 6337 1 0
## 7 HSQ 6615 1 0
## 8 MCQ 11744 1 0
## 9 BPQ 8501 1 0
## 10 DIQ 11744 1 0
## 11 HUQ 11933 1 0
## 12 INQ 11933 1 0
## 13 SMQ 9015 1 0
## 14 ALQ 6337 1 0
## 15 PAQ 8153 1 0
## 16 SLQ 8501 1 0
## 17 DBQ 11933 1 0
## 18 RXQ 11933 1 0
## 19 BMX 8860 1 0
## 20 BPXO 7801 1 0
## 21 GHB 7199 1 0
## 22 TCHOL 8068 1 0
Some components (e.g., individual foods, prescription medications, oscillometric blood pressure) are long-format with multiple rows per person. In the next chunk, I collapse these to per-person summaries and store them in an agg list that will later be merged with the person-level tables.
# Helper: average any present columns among a set of candidate column names
avg_cols <- function(df, cols) {
present <- intersect(names(df), cols)
if (length(present) == 0) return(NULL)
rowMeans(df[present], na.rm = TRUE)
}
agg <- list()
# A) Food-level intakes (DR1IFF / DR2IFF): derive simple per-person features
# Here I count how many food items were reported on day 1 and day 2.
if ("DR1IFF" %in% names(raw)) {
agg$DR1IFF <- raw$DR1IFF %>%
group_by(SEQN) %>%
summarise(
n_foods_day1 = n(), # number of reported food items day 1
.groups = "drop"
)
}
if ("DR2IFF" %in% names(raw)) {
agg$DR2IFF <- raw$DR2IFF %>%
group_by(SEQN) %>%
summarise(
n_foods_day2 = n(), # number of reported food items day 2
.groups = "drop"
)
}
# B) Medication list (RXQ_RX): count prescriptions per person
# Prefer distinct drug IDs if available; otherwise just count rows.
if ("RXQ" %in% names(raw)) {
med_id_col <- intersect(names(raw$RXQ), c("RXDDRGID","RXDRSC1","RXDRSC2"))
agg$RXQ <- raw$RXQ %>%
group_by(SEQN) %>%
summarise(
n_rx = if (length(med_id_col)) n_distinct(.data[[med_id_col[1]]], na.rm = TRUE) else n(),
any_rx = as.integer(n() > 0), # indicator for taking any medication
.groups = "drop"
)
}
# C) Oscillometric blood pressure (BPXO): average multiple readings per person
if ("BPXO" %in% names(raw)) {
bpx <- raw$BPXO %>%
mutate(
# average within visit across available SBP/DBP measurements
sbp_mean_row = avg_cols(., c("BPXOSY1","BPXOSY2","BPXOSY3")),
dbp_mean_row = avg_cols(., c("BPXODI1","BPXODI2","BPXODI3"))
) %>%
group_by(SEQN) %>%
summarise(
SBP_MEAN = mean(sbp_mean_row, na.rm = TRUE),
DBP_MEAN = mean(dbp_mean_row, na.rm = TRUE),
.groups = "drop"
)
agg$BPXO <- bpx
}
Finally, I identify the person-level tables (those already at one row per SEQN), append the aggregated long-form summaries, and merge everything into a single wide dataset df. I start the join from DEMO (if available) so that survey weights and design variables are preserved up front.
# Person-level tables: anything in `raw` that we did not aggregate above
person_level_names <- setdiff(names(raw), names(agg))
person_level <- raw[person_level_names]
# Build a single person-level table by joining all person-level + aggregated long tables
to_join <- c(person_level, agg)
# Start the join from DEMO if present, so weights/strata come first
start_name <- if ("DEMO" %in% names(to_join)) "DEMO" else names(to_join)[1]
df <- to_join[[start_name]]
# Sequentially left-join each remaining component on SEQN
for (nm in setdiff(names(to_join), start_name)) {
df <- full_join(df, to_join[[nm]], by = "SEQN")
}
# Sanity check: we should now have exactly one row per SEQN
stopifnot(!any(duplicated(df$SEQN)))
cat(
"One row per SEQN:", nrow(df), "respondents;",
ncol(df), "columns.\n"
)
## One row per SEQN: 11933 respondents; 432 columns.
Before engineering higher-level constructs, I first assess how complete each variable is. The next chunk computes, for every column, the number and percentage of non-missing values. This gives a quick sense of which variables are well-populated versus mostly missing.
nonmiss_tbl <- tibble::tibble(
column = names(df),
class = vapply(df, function(x) paste(class(x), collapse = ","), character(1)),
non_na = vapply(df, function(x) sum(!is.na(x)), integer(1)),
total = nrow(df)
) %>%
mutate(pct_non_na = round(100 * non_na / total, 1)) %>%
arrange(desc(pct_non_na), column)
print(nonmiss_tbl, n = Inf)
## # A tibble: 432 × 5
## column class non_na total pct_non_na
## <chr> <chr> <int> <int> <dbl>
## 1 DMDHHSIZ numeric 11933 11933 100
## 2 HUQ010 numeric 11933 11933 100
## 3 HUQ030 numeric 11933 11933 100
## 4 HUQ055 numeric 11933 11933 100
## 5 RIAGENDR numeric 11933 11933 100
## 6 RIDAGEYR numeric 11933 11933 100
## 7 RIDRETH1 numeric 11933 11933 100
## 8 RIDRETH3 numeric 11933 11933 100
## 9 RIDSTATR numeric 11933 11933 100
## 10 SDDSRVYR numeric 11933 11933 100
## 11 SDMVPSU numeric 11933 11933 100
## 12 SDMVSTRA numeric 11933 11933 100
## 13 SEQN numeric 11933 11933 100
## 14 WTINT2YR numeric 11933 11933 100
## 15 WTMEC2YR numeric 11933 11933 100
## 16 any_rx integer 11933 11933 100
## 17 n_rx integer 11933 11933 100
## 18 DMDBORN4 numeric 11914 11933 99.8
## 19 AGQ030 numeric 11743 11933 98.4
## 20 DIQ010 numeric 11740 11933 98.4
## 21 MCQ010 numeric 11743 11933 98.4
## 22 MCQ053 numeric 11741 11933 98.4
## 23 HUQ090 numeric 11169 11933 93.6
## 24 HUQ042 numeric 10702 11933 89.7
## 25 INDFMMPC numeric 10464 11933 87.7
## 26 INQ300 numeric 10468 11933 87.7
## 27 INDFMPIR numeric 9892 11933 82.9
## 28 SMAQUEX2 numeric 9015 11933 75.5
## 29 INDFMMPI numeric 8989 11933 75.3
## 30 BMDSTATS numeric 8860 11933 74.2
## 31 DR1DRSTZ numeric 8860 11933 74.2
## 32 DR2DRSTZ numeric 8860 11933 74.2
## 33 RIDEXMON numeric 8860 11933 74.2
## 34 WTDRD1.x numeric 8860 11933 74.2
## 35 WTDRD1.y numeric 8860 11933 74.2
## 36 BMXWT numeric 8754 11933 73.4
## 37 BMXARMC numeric 8562 11933 71.8
## 38 BMXARML numeric 8568 11933 71.8
## 39 BMXHT numeric 8499 11933 71.2
## 40 BPQ020 numeric 8498 11933 71.2
## 41 BPQ080 numeric 8498 11933 71.2
## 42 BPQ101D numeric 8498 11933 71.2
## 43 SLQ300 character 8501 11933 71.2
## 44 SLQ310 character 8501 11933 71.2
## 45 SLQ320 character 8501 11933 71.2
## 46 SLQ330 character 8501 11933 71.2
## 47 DBQ930 numeric 8486 11933 71.1
## 48 DBQ935 numeric 8486 11933 71.1
## 49 DBQ940 numeric 8486 11933 71.1
## 50 DBQ945 numeric 8486 11933 71.1
## 51 BMXBMI numeric 8471 11933 71
## 52 SLD012 numeric 8388 11933 70.3
## 53 SLD013 numeric 8387 11933 70.3
## 54 DIQ180 numeric 8304 11933 69.6
## 55 DMQMILIZ numeric 8301 11933 69.6
## 56 BMXWAIST numeric 8190 11933 68.6
## 57 PAD790U character 8153 11933 68.3
## 58 PAD810U character 8153 11933 68.3
## 59 PAD680 numeric 8138 11933 68.2
## 60 PAD790Q numeric 8135 11933 68.2
## 61 PAD810Q numeric 8139 11933 68.2
## 62 SMQ020 numeric 8135 11933 68.2
## 63 WTPH2YR.y numeric 8068 11933 67.6
## 64 DIQ160 numeric 8022 11933 67.2
## 65 MCQ160A numeric 7807 11933 65.4
## 66 MCQ160B numeric 7808 11933 65.4
## 67 MCQ160C numeric 7807 11933 65.4
## 68 MCQ160D numeric 7808 11933 65.4
## 69 MCQ160E numeric 7807 11933 65.4
## 70 MCQ160F numeric 7806 11933 65.4
## 71 MCQ160L numeric 7807 11933 65.4
## 72 MCQ160M numeric 7806 11933 65.4
## 73 MCQ160P numeric 7807 11933 65.4
## 74 MCQ220 numeric 7807 11933 65.4
## 75 MCQ550 numeric 7807 11933 65.4
## 76 MCQ560 numeric 7807 11933 65.4
## 77 DMDEDUC2 numeric 7794 11933 65.3
## 78 DMDMARTZ numeric 7792 11933 65.3
## 79 DBP_MEAN numeric 7518 11933 63
## 80 SBP_MEAN numeric 7518 11933 63
## 81 BMXLEG numeric 7335 11933 61.5
## 82 WTPH2YR.x numeric 7199 11933 60.3
## 83 LBDTCSI numeric 6890 11933 57.7
## 84 LBXTC numeric 6890 11933 57.7
## 85 DBQ095Z numeric 6797 11933 57
## 86 DR1DAY numeric 6797 11933 57
## 87 DR1EXMER numeric 6797 11933 57
## 88 DR1LANG numeric 6802 11933 57
## 89 DR1STY numeric 6797 11933 57
## 90 DR1TWSZ numeric 6797 11933 57
## 91 DR1_300 numeric 6797 11933 57
## 92 DRQSDIET numeric 6797 11933 57
## 93 DRQSPREP numeric 6797 11933 57
## 94 BMXHIP numeric 6776 11933 56.8
## 95 DR1BWATZ numeric 6754 11933 56.6
## 96 DR1TNUMF numeric 6754 11933 56.6
## 97 DR1_320Z numeric 6754 11933 56.6
## 98 DR1_330Z numeric 6754 11933 56.6
## 99 DRDINT.x numeric 6754 11933 56.6
## 100 DRDINT.y numeric 6754 11933 56.6
## 101 WTDR2D.x numeric 6754 11933 56.6
## 102 WTDR2D.y numeric 6754 11933 56.6
## 103 n_foods_day1 integer 6751 11933 56.6
## 104 DR1HELP numeric 6747 11933 56.5
## 105 DR1MRESP numeric 6747 11933 56.5
## 106 DRABF.x numeric 6734 11933 56.4
## 107 DRABF.y numeric 6734 11933 56.4
## 108 LBXGH numeric 6715 11933 56.3
## 109 DR1TACAR numeric 6694 11933 56.1
## 110 DR1TALCO numeric 6694 11933 56.1
## 111 DR1TATOA numeric 6694 11933 56.1
## 112 DR1TATOC numeric 6694 11933 56.1
## 113 DR1TB12A numeric 6694 11933 56.1
## 114 DR1TBCAR numeric 6694 11933 56.1
## 115 DR1TCAFF numeric 6694 11933 56.1
## 116 DR1TCALC numeric 6694 11933 56.1
## 117 DR1TCARB numeric 6694 11933 56.1
## 118 DR1TCHL numeric 6694 11933 56.1
## 119 DR1TCHOL numeric 6694 11933 56.1
## 120 DR1TCOPP numeric 6694 11933 56.1
## 121 DR1TCRYP numeric 6694 11933 56.1
## 122 DR1TFA numeric 6694 11933 56.1
## 123 DR1TFDFE numeric 6694 11933 56.1
## 124 DR1TFF numeric 6694 11933 56.1
## 125 DR1TFIBE numeric 6694 11933 56.1
## 126 DR1TFOLA numeric 6694 11933 56.1
## 127 DR1TIRON numeric 6694 11933 56.1
## 128 DR1TKCAL numeric 6694 11933 56.1
## 129 DR1TLYCO numeric 6694 11933 56.1
## 130 DR1TLZ numeric 6694 11933 56.1
## 131 DR1TM161 numeric 6694 11933 56.1
## 132 DR1TM181 numeric 6694 11933 56.1
## 133 DR1TM201 numeric 6694 11933 56.1
## 134 DR1TM221 numeric 6694 11933 56.1
## 135 DR1TMAGN numeric 6694 11933 56.1
## 136 DR1TMFAT numeric 6694 11933 56.1
## 137 DR1TMOIS numeric 6694 11933 56.1
## 138 DR1TNIAC numeric 6694 11933 56.1
## 139 DR1TP182 numeric 6694 11933 56.1
## 140 DR1TP183 numeric 6694 11933 56.1
## 141 DR1TP184 numeric 6694 11933 56.1
## 142 DR1TP204 numeric 6694 11933 56.1
## 143 DR1TP205 numeric 6694 11933 56.1
## 144 DR1TP225 numeric 6694 11933 56.1
## 145 DR1TP226 numeric 6694 11933 56.1
## 146 DR1TPFAT numeric 6694 11933 56.1
## 147 DR1TPHOS numeric 6694 11933 56.1
## 148 DR1TPOTA numeric 6694 11933 56.1
## 149 DR1TPROT numeric 6694 11933 56.1
## 150 DR1TRET numeric 6694 11933 56.1
## 151 DR1TS040 numeric 6694 11933 56.1
## 152 DR1TS060 numeric 6694 11933 56.1
## 153 DR1TS080 numeric 6694 11933 56.1
## 154 DR1TS100 numeric 6694 11933 56.1
## 155 DR1TS120 numeric 6694 11933 56.1
## 156 DR1TS140 numeric 6694 11933 56.1
## 157 DR1TS160 numeric 6694 11933 56.1
## 158 DR1TS180 numeric 6694 11933 56.1
## 159 DR1TSELE numeric 6694 11933 56.1
## 160 DR1TSFAT numeric 6694 11933 56.1
## 161 DR1TSODI numeric 6694 11933 56.1
## 162 DR1TSUGR numeric 6694 11933 56.1
## 163 DR1TTFAT numeric 6694 11933 56.1
## 164 DR1TTHEO numeric 6694 11933 56.1
## 165 DR1TVARA numeric 6694 11933 56.1
## 166 DR1TVB1 numeric 6694 11933 56.1
## 167 DR1TVB12 numeric 6694 11933 56.1
## 168 DR1TVB2 numeric 6694 11933 56.1
## 169 DR1TVB6 numeric 6694 11933 56.1
## 170 DR1TVC numeric 6694 11933 56.1
## 171 DR1TVD numeric 6694 11933 56.1
## 172 DR1TVK numeric 6694 11933 56.1
## 173 DR1TZINC numeric 6694 11933 56.1
## 174 DRD340 numeric 6694 11933 56.1
## 175 DRD360 numeric 6695 11933 56.1
## 176 PAD800 numeric 6390 11933 53.5
## 177 DR1DBIH numeric 6375 11933 53.4
## 178 DR2LANG numeric 5929 11933 49.7
## 179 DR2DAY numeric 5902 11933 49.5
## 180 DR2EXMER numeric 5902 11933 49.5
## 181 DR2STY numeric 5902 11933 49.5
## 182 DR2TWSZ numeric 5902 11933 49.5
## 183 DR2_300 numeric 5902 11933 49.5
## 184 DR2BWATZ numeric 5879 11933 49.3
## 185 DR2MRESP numeric 5879 11933 49.3
## 186 DR2TNUMF numeric 5879 11933 49.3
## 187 DR2_320Z numeric 5879 11933 49.3
## 188 DR2_330Z numeric 5879 11933 49.3
## 189 n_foods_day2 integer 5879 11933 49.3
## 190 DR2HELP numeric 5876 11933 49.2
## 191 DR2TACAR numeric 5830 11933 48.9
## 192 DR2TALCO numeric 5830 11933 48.9
## 193 DR2TATOA numeric 5830 11933 48.9
## 194 DR2TATOC numeric 5830 11933 48.9
## 195 DR2TB12A numeric 5830 11933 48.9
## 196 DR2TBCAR numeric 5830 11933 48.9
## 197 DR2TCAFF numeric 5830 11933 48.9
## 198 DR2TCALC numeric 5830 11933 48.9
## 199 DR2TCARB numeric 5830 11933 48.9
## 200 DR2TCHL numeric 5830 11933 48.9
## 201 DR2TCHOL numeric 5830 11933 48.9
## 202 DR2TCOPP numeric 5830 11933 48.9
## 203 DR2TCRYP numeric 5830 11933 48.9
## 204 DR2TFA numeric 5830 11933 48.9
## 205 DR2TFDFE numeric 5830 11933 48.9
## 206 DR2TFF numeric 5830 11933 48.9
## 207 DR2TFIBE numeric 5830 11933 48.9
## 208 DR2TFOLA numeric 5830 11933 48.9
## 209 DR2TIRON numeric 5830 11933 48.9
## 210 DR2TKCAL numeric 5830 11933 48.9
## 211 DR2TLYCO numeric 5830 11933 48.9
## 212 DR2TLZ numeric 5830 11933 48.9
## 213 DR2TM161 numeric 5830 11933 48.9
## 214 DR2TM181 numeric 5830 11933 48.9
## 215 DR2TM201 numeric 5830 11933 48.9
## 216 DR2TM221 numeric 5830 11933 48.9
## 217 DR2TMAGN numeric 5830 11933 48.9
## 218 DR2TMFAT numeric 5830 11933 48.9
## 219 DR2TMOIS numeric 5830 11933 48.9
## 220 DR2TNIAC numeric 5830 11933 48.9
## 221 DR2TP182 numeric 5830 11933 48.9
## 222 DR2TP183 numeric 5830 11933 48.9
## 223 DR2TP184 numeric 5830 11933 48.9
## 224 DR2TP204 numeric 5830 11933 48.9
## 225 DR2TP205 numeric 5830 11933 48.9
## 226 DR2TP225 numeric 5830 11933 48.9
## 227 DR2TP226 numeric 5830 11933 48.9
## 228 DR2TPFAT numeric 5830 11933 48.9
## 229 DR2TPHOS numeric 5830 11933 48.9
## 230 DR2TPOTA numeric 5830 11933 48.9
## 231 DR2TPROT numeric 5830 11933 48.9
## 232 DR2TRET numeric 5830 11933 48.9
## 233 DR2TS040 numeric 5830 11933 48.9
## 234 DR2TS060 numeric 5830 11933 48.9
## 235 DR2TS080 numeric 5830 11933 48.9
## 236 DR2TS100 numeric 5830 11933 48.9
## 237 DR2TS120 numeric 5830 11933 48.9
## 238 DR2TS140 numeric 5830 11933 48.9
## 239 DR2TS160 numeric 5830 11933 48.9
## 240 DR2TS180 numeric 5830 11933 48.9
## 241 DR2TSELE numeric 5830 11933 48.9
## 242 DR2TSFAT numeric 5830 11933 48.9
## 243 DR2TSODI numeric 5830 11933 48.9
## 244 DR2TSUGR numeric 5830 11933 48.9
## 245 DR2TTFAT numeric 5830 11933 48.9
## 246 DR2TTHEO numeric 5830 11933 48.9
## 247 DR2TVARA numeric 5830 11933 48.9
## 248 DR2TVB1 numeric 5830 11933 48.9
## 249 DR2TVB12 numeric 5830 11933 48.9
## 250 DR2TVB2 numeric 5830 11933 48.9
## 251 DR2TVB6 numeric 5830 11933 48.9
## 252 DR2TVC numeric 5830 11933 48.9
## 253 DR2TVD numeric 5830 11933 48.9
## 254 DR2TVK numeric 5830 11933 48.9
## 255 DR2TZINC numeric 5830 11933 48.9
## 256 HSQ590 numeric 5751 11933 48.2
## 257 OSQ230 numeric 5723 11933 48
## 258 IND310 numeric 5650 11933 47.3
## 259 DR2DBIH numeric 5550 11933 46.5
## 260 DPQ010 numeric 5519 11933 46.2
## 261 DPQ020 numeric 5518 11933 46.2
## 262 DPQ030 numeric 5516 11933 46.2
## 263 DPQ040 numeric 5514 11933 46.2
## 264 DPQ050 numeric 5513 11933 46.2
## 265 DPQ060 numeric 5510 11933 46.2
## 266 DPQ070 numeric 5508 11933 46.2
## 267 DPQ080 numeric 5508 11933 46.2
## 268 DPQ090 numeric 5506 11933 46.1
## 269 ALQ111 numeric 5481 11933 45.9
## 270 ALQ121 numeric 4922 11933 41.2
## 271 ALQ151 numeric 4901 11933 41.1
## 272 DBD100 numeric 4742 11933 39.7
## 273 DRD370A numeric 4175 11933 35
## 274 DRD370B numeric 4175 11933 35
## 275 DRD370C numeric 4175 11933 35
## 276 DRD370D numeric 4175 11933 35
## 277 DRD370E numeric 4175 11933 35
## 278 DRD370F numeric 4175 11933 35
## 279 DRD370G numeric 4175 11933 35
## 280 DRD370H numeric 4175 11933 35
## 281 DRD370I numeric 4175 11933 35
## 282 DRD370J numeric 4175 11933 35
## 283 DRD370K numeric 4175 11933 35
## 284 DRD370L numeric 4175 11933 35
## 285 DRD370M numeric 4175 11933 35
## 286 DRD370N numeric 4175 11933 35
## 287 DRD370O numeric 4175 11933 35
## 288 DRD370P numeric 4175 11933 35
## 289 DRD370Q numeric 4175 11933 35
## 290 DRD370R numeric 4175 11933 35
## 291 DRD370S numeric 4175 11933 35
## 292 DRD370T numeric 4175 11933 35
## 293 DRD370U numeric 4175 11933 35
## 294 DRD370V numeric 4173 11933 35
## 295 DPQ100 numeric 4167 11933 34.9
## 296 DMDHRAGZ numeric 4124 11933 34.6
## 297 DMDHRGND numeric 4115 11933 34.5
## 298 ALQ142 numeric 4082 11933 34.2
## 299 ALQ130 numeric 4069 11933 34.1
## 300 DMDHRMAZ numeric 4020 11933 33.7
## 301 DMDHREDZ numeric 3746 11933 31.4
## 302 PAD820 numeric 3687 11933 30.9
## 303 DBQ301 numeric 3500 11933 29.3
## 304 DBQ330 numeric 3500 11933 29.3
## 305 DBQ360 numeric 3357 11933 28.1
## 306 SMQ040 numeric 3243 11933 27.2
## 307 DRD350A numeric 3033 11933 25.4
## 308 DRD350B numeric 3033 11933 25.4
## 309 DRD350C numeric 3033 11933 25.4
## 310 DRD350D numeric 3033 11933 25.4
## 311 DRD350E numeric 3033 11933 25.4
## 312 DRD350F numeric 3033 11933 25.4
## 313 DRD350G numeric 3033 11933 25.4
## 314 DRD350H numeric 3033 11933 25.4
## 315 DRD350I numeric 3032 11933 25.4
## 316 DRD350J numeric 3032 11933 25.4
## 317 DRD350K numeric 3031 11933 25.4
## 318 BPQ030 numeric 2968 11933 24.9
## 319 BPQ150 numeric 2969 11933 24.9
## 320 RIDEXAGM numeric 2787 11933 23.4
## 321 DBQ370 numeric 2762 11933 23.1
## 322 DBQ400 numeric 2762 11933 23.1
## 323 DRD350HQ numeric 2660 11933 22.3
## 324 DBD381 numeric 2643 11933 22.1
## 325 MCQ195 numeric 2532 11933 21.2
## 326 BMDBMIC numeric 2492 11933 20.9
## 327 DBD411 numeric 2378 11933 19.9
## 328 ALQ170 numeric 2358 11933 19.8
## 329 ALQ270 numeric 2366 11933 19.8
## 330 ALQ280 numeric 2362 11933 19.8
## 331 DIQ070 numeric 2281 11933 19.1
## 332 DBQ390 numeric 2159 11933 18.1
## 333 DMDHSEDZ numeric 2127 11933 17.8
## 334 DRD370MQ numeric 2037 11933 17.1
## 335 MCQ035 numeric 1946 11933 16.3
## 336 DMDYRUSR numeric 1875 11933 15.7
## 337 DRD370BQ numeric 1864 11933 15.6
## 338 DBQ424 numeric 1780 11933 14.9
## 339 MCQ500 numeric 1578 11933 13.2
## 340 RIDEXPRG numeric 1503 11933 12.6
## 341 DBD041 numeric 1441 11933 12.1
## 342 DBD055 numeric 1440 11933 12.1
## 343 DBQ010 numeric 1441 11933 12.1
## 344 DBQ421 numeric 1400 11933 11.7
## 345 DBD061 numeric 1356 11933 11.4
## 346 MCQ040 numeric 1219 11933 10.2
## 347 MCQ050 numeric 1219 11933 10.2
## 348 DR1SKY numeric 1197 11933 10
## 349 SMD650 numeric 1185 11933 9.9
## 350 MCQ230A numeric 1169 11933 9.8
## 351 SMD100MN numeric 1175 11933 9.8
## 352 DBD050 numeric 1153 11933 9.7
## 353 DBD030 numeric 1121 11933 9.4
## 354 DID040 numeric 1081 11933 9.1
## 355 DIQ050 numeric 1081 11933 9.1
## 356 DR2SKY numeric 1074 11933 9
## 357 MCQ170M numeric 1053 11933 8.8
## 358 DRD370TQ numeric 982 11933 8.2
## 359 DBQ073A numeric 869 11933 7.3
## 360 SMQ621 numeric 772 11933 6.5
## 361 DRD370EQ numeric 701 11933 5.9
## 362 DRD370AQ numeric 628 11933 5.3
## 363 DRD350BQ numeric 548 11933 4.6
## 364 BMXRECUM numeric 454 11933 3.8
## 365 DRQSDT1 numeric 447 11933 3.7
## 366 MCQ149 numeric 434 11933 3.6
## 367 MCQ170L numeric 425 11933 3.6
## 368 BMILEG numeric 396 11933 3.3
## 369 RIDAGEMN numeric 377 11933 3.2
## 370 BMIHIP numeric 361 11933 3
## 371 DRD350GQ numeric 357 11933 3
## 372 BMIWAIST numeric 347 11933 2.9
## 373 BMIWT numeric 345 11933 2.9
## 374 DID060 numeric 343 11933 2.9
## 375 DRD370DQ numeric 347 11933 2.9
## 376 DIQ060U numeric 332 11933 2.8
## 377 DRD350DQ numeric 313 11933 2.6
## 378 SMD641 numeric 273 11933 2.3
## 379 MCQ510A numeric 260 11933 2.2
## 380 DRD350AQ numeric 256 11933 2.1
## 381 DRD350FQ numeric 230 11933 1.9
## 382 DRD350IQ numeric 223 11933 1.9
## 383 DRD370GQ numeric 223 11933 1.9
## 384 DRD350EQ numeric 217 11933 1.8
## 385 DRD370NQ numeric 213 11933 1.8
## 386 BMIARMC numeric 205 11933 1.7
## 387 BMIARML numeric 200 11933 1.7
## 388 DBQ073B numeric 180 11933 1.5
## 389 DRD370KQ numeric 172 11933 1.4
## 390 DRD370RQ numeric 153 11933 1.3
## 391 MCQ230B numeric 159 11933 1.3
## 392 DRQSDT91 numeric 148 11933 1.2
## 393 BMIHT numeric 134 11933 1.1
## 394 DRD370FQ numeric 134 11933 1.1
## 395 DRD370UQ numeric 124 11933 1
## 396 DRD370SQ numeric 96 11933 0.8
## 397 DRQSDT3 numeric 96 11933 0.8
## 398 DRQSDT7 numeric 93 11933 0.8
## 399 DBQ073U numeric 82 11933 0.7
## 400 DRD370HQ numeric 85 11933 0.7
## 401 DRD370OQ numeric 83 11933 0.7
## 402 MCQ510F numeric 79 11933 0.7
## 403 BMXHEAD numeric 70 11933 0.6
## 404 DBQ073C numeric 66 11933 0.6
## 405 DRD370IQ numeric 76 11933 0.6
## 406 DRQSDT2 numeric 70 11933 0.6
## 407 DRQSDT9 numeric 70 11933 0.6
## 408 DRD350CQ numeric 62 11933 0.5
## 409 DRD370CQ numeric 60 11933 0.5
## 410 MCQ510D numeric 62 11933 0.5
## 411 DRD370QQ numeric 45 11933 0.4
## 412 DRQSDT4 numeric 44 11933 0.4
## 413 MCQ510C numeric 47 11933 0.4
## 414 DRQSDT10 numeric 33 11933 0.3
## 415 DRQSDT11 numeric 32 11933 0.3
## 416 BMIRECUM numeric 18 11933 0.2
## 417 DBQ073E numeric 21 11933 0.2
## 418 DRD370LQ numeric 18 11933 0.2
## 419 DRQSDT8 numeric 27 11933 0.2
## 420 MCQ230C numeric 26 11933 0.2
## 421 SMD630 numeric 23 11933 0.2
## 422 DBQ073D numeric 16 11933 0.1
## 423 DRD350JQ numeric 15 11933 0.1
## 424 DRD370JQ numeric 12 11933 0.1
## 425 DRQSDT12 numeric 13 11933 0.1
## 426 DRQSDT6 numeric 7 11933 0.1
## 427 MCQ510E numeric 16 11933 0.1
## 428 BMIHEAD numeric 0 11933 0
## 429 DRD370PQ numeric 4 11933 0
## 430 DRQSDT5 numeric 2 11933 0
## 431 MCQ230D numeric 2 11933 0
## 432 MCQ510B numeric 4 11933 0
For the analysis, I restrict attention to variables with at least 100 non-missing observations, which balances keeping useful signal and avoiding extremely sparse fields. I also save this cleaned person-level dataset to disk for reproducibility.
keep_by_nonmissing <- function(data, n = 100) {
counts <- colSums(!is.na(data))
keep <- counts >= n
kept_cols <- names(data)[keep]
dropped_cols <- names(data)[!keep]
message(sprintf("Kept %d cols, dropped %d cols (threshold = %d non-NA).",
length(kept_cols), length(dropped_cols), n))
list(
df = data[, kept_cols, drop = FALSE],
counts = counts,
kept = kept_cols,
dropped = dropped_cols
)
}
res <- keep_by_nonmissing(df, n = 100)
## Kept 395 cols, dropped 37 cols (threshold = 100 non-NA).
df <- res$df
out_path <- "Data 237 Final_Datasets/df_clean.csv"
# readr::write_csv is fast and UTF-8 safe
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")
readr::write_csv(df, out_path)
NHANES XPT files carry both variable labels (descriptions) and, for
labelled variables, value labels (e.g., 1 = Yes, 2 = No). To keep track
of this metadata, I construct a codebook across all
component files and then a data dictionary for the
final merged df_clean.csv. This lets me trace each analytic
variable back to its original NHANES source and interpret categorical
codes correctly.
# Function: extract labels (variable & value labels) from one XPT
xpt_to_codebook <- function(path, source_name = basename(path)) {
if (!file.exists(path)) return(tibble())
dx <- read_xpt(path)
vars <- names(dx)
# variable labels
var_lab <- map_chr(dx, ~ attr(.x, "label") %||% "")
# class info
classes <- map_chr(dx, ~ paste(class(.x), collapse = "|"))
# value labels (if labelled)
val_labs <- map(dx, function(x) {
labs <- attr(x, "labels", exact = TRUE)
if (is.null(labs)) return(NA_character_)
# turn named integer vector into "code=label; code=label; ..."
paste(paste0(unname(labs), "=", names(labs)), collapse = "; ")
}) |> unlist(use.names = FALSE)
tibble(
source = source_name,
varname = vars,
var_label = var_lab,
class = classes,
value_labels = val_labs
)
}
# Build merged codebook from all XPTs
codebook <- imap_dfr(paths, ~ xpt_to_codebook(.x, source_name = .y))
codebook_path <- "Data 237 Final_Datasets/NHANES_2123_codebook_from_XPT.csv"
#write_csv(codebook, codebook_path) #print if viewer would like to see detailed descriptions for each dataset column, a csv version is included in the submission folder.
# Augment a CSV dictionary for df_clean.csv
df_csv_path <- "Data 237 Final_Datasets/df_clean.csv"
if (file.exists(df_csv_path)) {
df_csv <- read_csv(df_csv_path, show_col_types = FALSE)
dict <- tibble(
varname = names(df_csv),
class_in_csv = vapply(df_csv, function(x) paste(class(x), collapse = "|"), character(1)),
non_na = vapply(df_csv, function(x) sum(!is.na(x)), integer(1)),
n_unique = vapply(df_csv, function(x) dplyr::n_distinct(x, na.rm = TRUE), integer(1))
) |>
left_join(codebook |> select(source, varname, var_label, value_labels),
by = "varname") |>
arrange(varname)
dict_path <- "Data 237 Final_Datasets/df_clean_DATA_DICTIONARY.csv"
write_csv(dict, dict_path)
message("Wrote df dictionary: ", normalizePath(dict_path))
}
## Wrote df dictionary: /Users/leon03/Downloads/Leon Zhou_Data237 Final Project_Final/Data 237 Final_Datasets/df_clean_DATA_DICTIONARY.csv
The main goal is to relate diet quality and behavior load to cardiometabolic outcomes. To do this I:
These steps create a compact analytic dataset df1 with
socio-demographics, behaviors, and clinical markers that I will use for
all downstream plots.
# Helper: given a regex pattern, find the first matching column name
# in the NHANES dictionary that is actually present in df.
find_col <- function(pattern) {
hits <- dict$varname[str_detect(dict$varname, regex(pattern, ignore_case = TRUE))]
hits <- hits[hits %in% names(df)]
if (length(hits)) hits[1] else NA_character_
}
# Map each conceptual construct to its underlying NHANES variable(s).
# The regexes allow some robustness across different naming versions.
vars <- list(
age = find_col("^RIDAGEYR$|RIDAGE"), # age in years
sex = find_col("^RIAGENDR$|RIAGEN"), # sex (1=Male, 2=Female)
race = find_col("^RIDRETH3$|RIDRETH"), # race/ethnicity
pir = find_col("^INDFMPIR$|PIR"), # income-to-poverty ratio
access = find_col("^HUQ090$|HUQ090"), # seen/talked to provider past 12m (1=yes)
btest = find_col("^DIQ180$|DIQ180"), # diabetes blood test past 12m (1=yes)
dx_htn = find_col("^BPQ020$|BPQ020"), # ever told HTN (1=yes)
dx_dm = find_col("^DIQ010$|DIQ010"), # ever told DM (1=yes)
bmi = find_col("^BMXBMI$|BMI"), # body mass index
sbp = find_col("^SBP_MEAN$|BPXOSY|SBP"), # SBP (prefer aggregated SBP_MEAN if present)
dbp = find_col("^DBP_MEAN$|BPXODI|DBP"), # DBP
a1c = find_col("^LBXGH$|^HBA1C$|^GHB"), # glycohemoglobin A1c
sleep = find_col("^SLD010H$|^SLQ0|SLEEP"), # usual hours sleep
smoke = find_col("^SMQ020$|^SMQ0|SMOK"), # current smoking indicator
alc_days= find_col("^ALQ120Q$|^ALQ101$|^ALQ130$|ALC"), # alcohol frequency proxy
pa_mod = find_col("^PAQ6|^PAQ65|^PAQ706|^PAQ(.*)MOD"), # moderate PA minutes or frequency
dpq_sum = find_col("^DPQ(.*)TOT$|^DPQ\\d+$") # PHQ-9 total or individual DPQ items
)
print(vars)
## $age
## [1] "RIDAGEMN"
##
## $sex
## [1] "RIAGENDR"
##
## $race
## [1] "RIDRETH1"
##
## $pir
## [1] "INDFMPIR"
##
## $access
## [1] "HUQ090"
##
## $btest
## [1] "DIQ180"
##
## $dx_htn
## [1] "BPQ020"
##
## $dx_dm
## [1] "DIQ010"
##
## $bmi
## [1] "BMDBMIC"
##
## $sbp
## [1] "SBP_MEAN"
##
## $dbp
## [1] "DBP_MEAN"
##
## $a1c
## [1] "LBXGH"
##
## $sleep
## [1] NA
##
## $smoke
## [1] "SMQ020"
##
## $alc_days
## [1] "ALQ130"
##
## $pa_mod
## [1] NA
##
## $dpq_sum
## [1] "DPQ010"
# Generic accessors to standardize how we pull and transform columns
col_or_na <- function(data, nm) {
# Return the column if it exists, otherwise a vector of NAs
if (!is.na(nm) && length(nm) == 1 && nm %in% names(data)) data[[nm]] else NA
}
yn_vec <- function(x) as.integer(!is.na(x) & x == 1) # convert NHANES 1=yes to 0/1
num_vec <- function(x) suppressWarnings(as.numeric(x)) # numeric coercion with warning suppression
# Build df1: a person-level analytic frame with typed variables
df1 <- df %>%
mutate(
# Demographics and income
age = col_or_na(df, vars$age),
sex_raw = col_or_na(df, vars$sex),
sex = factor(ifelse(is.na(sex_raw), NA, sex_raw),
levels = c(1, 2), labels = c("Male","Female")),
pir = col_or_na(df, vars$pir),
pir_grp = cut(
pir,
breaks = quantile(pir, probs = c(0, 1/3, 2/3, 1), na.rm = TRUE),
include.lowest = TRUE,
labels = c("Low PIR","Mid PIR","High PIR")
),
# Access and diagnosis indicators
access = yn_vec(col_or_na(df, vars$access)),
btest = yn_vec(col_or_na(df, vars$btest)),
dx_htn = yn_vec(col_or_na(df, vars$dx_htn)),
dx_dm = yn_vec(col_or_na(df, vars$dx_dm)),
# Clinical measurements
bmi = num_vec(col_or_na(df, vars$bmi)),
sbp = num_vec(col_or_na(df, vars$sbp)),
dbp = num_vec(col_or_na(df, vars$dbp)),
a1c = num_vec(col_or_na(df, vars$a1c)),
# Behavioral measures
sleep_h = num_vec(col_or_na(df, vars$sleep)), # usual hours of sleep
smk_now = yn_vec(col_or_na(df, vars$smoke)), # current smoker (0/1)
alc_days= num_vec(col_or_na(df, vars$alc_days)), # drinking frequency proxy
pa_mod = num_vec(col_or_na(df, vars$pa_mod)) # moderate physical activity
)
# Quick sanity check on basic distributions
summary(select(
df1, age, sex, pir, pir_grp, access, btest, dx_htn, dx_dm,
bmi, sbp, dbp, a1c, sleep_h, smk_now, alc_days, pa_mod
))
## age sex pir pir_grp access
## Min. : 0.00 Male :5575 Min. :0.000 Low PIR :3311 Min. :0.0000
## 1st Qu.: 6.00 Female:6358 1st Qu.:1.180 Mid PIR :3305 1st Qu.:0.0000
## Median :11.00 Median :2.500 High PIR:3276 Median :0.0000
## Mean :11.63 Mean :2.708 NA's :2041 Mean :0.1378
## 3rd Qu.:17.00 3rd Qu.:4.500 3rd Qu.:0.0000
## Max. :24.00 Max. :5.000 Max. :1.0000
## NA's :11556 NA's :2041
## btest dx_htn dx_dm bmi
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:2.000
## Median :0.0000 Median :0.0000 Median :0.00000 Median :2.000
## Mean :0.2824 Mean :0.2488 Mean :0.09059 Mean :2.561
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:3.000
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :4.000
## NA's :9441
## sbp dbp a1c sleep_h
## Min. : 70.0 Min. : 34.00 Min. : 3.20 Min. : NA
## 1st Qu.:106.3 1st Qu.: 64.00 1st Qu.: 5.20 1st Qu.: NA
## Median :116.3 Median : 71.67 Median : 5.50 Median : NA
## Mean :119.1 Mean : 72.21 Mean : 5.71 Mean :NaN
## 3rd Qu.:129.0 3rd Qu.: 79.33 3rd Qu.: 5.80 3rd Qu.: NA
## Max. :232.3 Max. :139.00 Max. :17.10 Max. : NA
## NA's :4415 NA's :4415 NA's :5218 NA's :11933
## smk_now alc_days pa_mod
## Min. :0.0000 Min. : 1.000 Min. : NA
## 1st Qu.:0.0000 1st Qu.: 1.000 1st Qu.: NA
## Median :0.0000 Median : 2.000 Median : NA
## Mean :0.2718 Mean : 5.843 Mean :NaN
## 3rd Qu.:1.0000 3rd Qu.: 3.000 3rd Qu.: NA
## Max. :1.0000 Max. :999.000 Max. : NA
## NA's :7864 NA's :11933
# Helper: row-wise mean for a set of numeric columns (e.g., diet day 1 + day 2)
row_mean_numeric <- function(data, cols) {
if (length(cols) == 0) return(rep(NA_real_, nrow(data)))
M <- as.data.frame(lapply(
cols,
function(cn) suppressWarnings(as.numeric(data[[cn]]))
))
rowMeans(M, na.rm = TRUE)
}
# Helper: rescale numeric vector to [0, 1] range, guarding against degenerate ranges
rescale01 <- function(x) {
r <- range(x, na.rm = TRUE)
if (!all(is.finite(r)) || diff(r) == 0) return(rep(NA_real_, length(x)))
(x - r[1]) / diff(r)
}
# Identify diet recall variables (across day 1 and day 2 totals)
diet_kcal_vars <- intersect(
names(df1),
grep("^DR[12].*KCAL$", names(df1), ignore.case = TRUE, value = TRUE)
)
diet_sfat_vars <- intersect(
names(df1),
grep("^DR[12].*(SFAT|TSFAT)$", names(df1), ignore.case = TRUE, value = TRUE)
)
diet_fibe_vars <- intersect(
names(df1),
grep("^DR[12].*(FIBE|DFIB)$", names(df1), ignore.case = TRUE, value = TRUE)
)
# Print which diet variables were found, just for debugging / transparency
print(list(kcal = diet_kcal_vars, sfat = diet_sfat_vars, fibe = diet_fibe_vars))
## $kcal
## [1] "DR1TKCAL" "DR2TKCAL"
##
## $sfat
## [1] "DR1TSFAT" "DR2TSFAT"
##
## $fibe
## [1] "DR1TFIBE" "DR2TFIBE"
# Construct energy-normalized diet variables and composite diet score
df1 <- df1 %>%
mutate(
# Average energy, saturated fat, and fiber across available recall days
diet_kcal = if (length(diet_kcal_vars)) {
rowMeans(across(all_of(diet_kcal_vars)), na.rm = TRUE)
} else NA_real_,
diet_sfat = if (length(diet_sfat_vars)) {
rowMeans(across(all_of(diet_sfat_vars)), na.rm = TRUE)
} else NA_real_,
diet_fibe = if (length(diet_fibe_vars)) {
rowMeans(across(all_of(diet_fibe_vars)), na.rm = TRUE)
} else NA_real_
) %>%
mutate(
# Densities (per kcal) to partially adjust for total energy intake
satfat_dens = diet_sfat / pmax(diet_kcal, 1),
fiber_dens = diet_fibe / pmax(diet_kcal, 1),
# Composite diet score: higher is "better" (low saturated fat, high fiber)
diet_score = rescale01(-satfat_dens) + rescale01(fiber_dens)
)
# -------------------------------------------------------------------
# PHQ-9 total (depression severity) and Behavior Load Index (BLI)
# -------------------------------------------------------------------
# If a precomputed PHQ-9 total is not present, compute it by summing DPQ items.
if (!"dpq_sum" %in% names(df1) || all(is.na(df1$dpq_sum))) {
dpq_items <- dict$varname[
grepl("^DPQ0\\d+$", dict$varname) & dict$varname %in% names(df1)
]
if (length(dpq_items) > 0) {
df1 <- df1 %>%
mutate(dpq_sum = rowSums(dplyr::across(dplyr::all_of(dpq_items)), na.rm = TRUE))
} else {
df1 <- df1 %>% mutate(dpq_sum = NA_real_)
}
}
Below I define the metric of Behavior Load Index (BLI): count six risk factors into a 0–6 index # 1) current smoking # 2) heavy alcohol use # 3) short (<6h) or long (>9h) sleep # 4) low physical activity # 5) depression (PHQ-9 >= 10) # 6) low diet score (bottom third)
q_diet <- suppressWarnings(quantile(df1$diet_score, probs = 1/3, na.rm = TRUE))
diet_cut <- if (is.finite(q_diet)) q_diet else NA_real_
df1 <- df1 %>%
mutate(
sleep_flag = as.integer(!is.na(sleep_h) & (sleep_h < 6 | sleep_h > 9)),
heavy_alc = as.integer(!is.na(alc_days) & alc_days >= 4),
low_pa = as.integer(!is.na(pa_mod) & pa_mod < 150),
dep_flag = as.integer(!is.na(dpq_sum) & dpq_sum >= 10),
low_diet = as.integer(!is.na(diet_score) & !is.na(diet_cut) & diet_score <= diet_cut),
# Raw BLI: sum of all risk flags, ignoring missingness within each person
bli_raw = rowSums(
cbind(smk_now, heavy_alc, sleep_flag, low_pa, dep_flag, low_diet),
na.rm = TRUE
)
) %>%
mutate(
# Convert BLI into quintiles for plotting (Q1 = lowest load, Q5 = highest load)
bli_q = if (all(is.na(bli_raw))) NA_integer_ else dplyr::ntile(bli_raw, 5),
bli_q = factor(
bli_q,
labels = c("BLI Q1 (lowest)","Q2","Q3","Q4","BLI Q5 (highest)")
)
)
5. Tertiles, age bands, and tile-plot helpers
For the explorative heatmaps below, I want most patterns to be readable “at a glance,” so I compress several continuous variables into tertiles or bands:
diet_terc: tertiles of the composite diet score (Low /
Mid / High).bli_terc: tertiles of the raw Behavior Load Index
(0–6).pir_terc: tertiles of income-to-poverty ratio (PIR),
with a fallback that reuses pir_grp if needed.age_band: four broad adult age groups.After creating these banded variables, I define two generic helpers:
tile_df()
Takes any dataset plus:
and returns a summarized table with the cell count n,
either percent “Yes” for binary outcomes or a mean/median for continuous
outcomes, plus a human-friendly label string for plotting.
tile_plot()
Takes the tile_df() output and produces a standardized
heatmap:
These helpers keep the later figure code short and readable: once the
bands and helpers are defined, each new heatmap is just a call to
tile_df() followed by tile_plot().
df1 <- df1 %>%
mutate(
# Diet quality tertiles (Low / Mid / High)
diet_terc = case_when(
is.na(diet_score) ~ NA_character_,
TRUE ~ paste(c("Low","Mid","High")[ntile(diet_score, 3)])
) |> factor(c("Low","Mid","High"), ordered = TRUE),
# BLI tertiles from raw 0–6 index
bli_terc = if (!all(is.na(bli_raw))) {
paste(c("Low","Mid","High")[ntile(bli_raw, 3)])
} else NA_character_
) %>%
mutate(
bli_terc = factor(bli_terc, levels = c("Low","Mid","High"), ordered = TRUE),
# PIR tertiles (income)
pir_terc = case_when(
!is.na(pir) ~ paste(c("Low","Mid","High")[ntile(pir, 3)]),
!is.na(pir_grp) ~ case_when(
pir_grp %in% c("Low PIR","Low") ~ "Low",
pir_grp %in% c("Mid PIR","Mid") ~ "Mid",
pir_grp %in% c("High PIR","High") ~ "High",
TRUE ~ NA_character_
),
TRUE ~ NA_character_
) |> factor(c("Low","Mid","High"), ordered = TRUE),
# Age bands
age_band = cut(age, c(18,35,50,65,Inf), right = FALSE,
labels = c("18–34","35–49","50–64","65+"))
)
has_col <- function(data, nm) (nm %in% names(data)) && any(!is.na(data[[nm]]))
tile_df <- function(data, x_band, y_band, value, type = c("binary","mean","median")) {
type <- match.arg(type)
x_band <- rlang::ensym(x_band)
y_band <- rlang::ensym(y_band)
value <- rlang::enquo(value)
data %>%
filter(!is.na(!!x_band), !is.na(!!y_band)) %>%
group_by(!!y_band, !!x_band) %>%
summarise(
n = sum(!is.na(!!value)),
pct_yes = if (type == "binary") 100 * mean((!!value) == 1, na.rm = TRUE) else NA_real_,
mean_val = if (type == "mean") mean((!!value), na.rm = TRUE) else NA_real_,
med_val = if (type == "median") median((!!value), na.rm = TRUE) else NA_real_,
.groups = "drop"
) %>%
mutate(
fill_val = dplyr::coalesce(pct_yes, mean_val, med_val),
label = case_when(
!is.na(pct_yes) ~ paste0(round(pct_yes), "%"),
!is.na(mean_val) ~ sprintf("%.1f", mean_val),
!is.na(med_val) ~ sprintf("%.1f", med_val),
TRUE ~ ""
)
) %>%
rename(y = !!y_band, x = !!x_band)
}
tile_plot <- function(tile_dat, title, fill_label = "% Yes", limits = c(0,100)) {
ggplot(tile_dat, aes(x, y, fill = fill_val)) +
geom_tile(color = "white") +
geom_text(aes(label = label), size = 3.5, fontface = "bold", color = "white") +
scale_fill_gradient(
low = "#eff3ff", high = "#084594",
limits = limits, oob = scales::squish, name = fill_label
) +
labs(title = title, x = "Diet quality (tertiles)", y = "Stratifier (tertiles/bands)") +
theme_minimal(base_size = 12) +
theme(
panel.grid = element_blank(),
axis.title.y = element_text(margin = margin(r = 6)),
axis.title.x = element_text(margin = margin(t = 6)),
legend.position = "right",
plot.title = element_text(face = "bold")
)
}
Data Analysis and Visualization
After data preparation, I start from a practical question: when does better diet quality actually translate into better care and cardiometabolic control, and when is it overshadowed by other forces? Diet is a modifiable behavior, but people do not eat in a vacuum—other behaviors (smoking, sleep, activity, alcohol, depression) and structural constraints (income and access) travel with diet and may blunt or amplify its effects. Understanding how diet quality operates within these broader contexts is crucial for deciding whether to prioritize diet counseling alone, multi-behavior interventions, or structural policies.
Here I focus on the interaction between diet quality and two “anchor”
contexts: behavior load (BLI, a 0–6 index of
co-occurring risk behaviors) and income (PIR tertiles).
My working hypotheses are:
1. Within a given behavior or income stratum, higher diet quality should
be associated with better prevention and control (lower
A1c ≥ 7%, less uncontrolled BP, at least as good access).
2. At the same time, large gaps across BLI or PIR at a fixed diet
tertile would suggest that behavior clustering and socioeconomic
position dominate over diet alone.
To probe these hypotheses, Figure A uses 3×3 heatmaps (diet tertiles × BLI or PIR) to summarize, for each subgroup, the prevalence of six care/outcome measures:
Each cell shows the percent of adults in that diet–context combination with the outcome, with darker tiles indicating higher prevalence. The design choice of tertiles balances resolution with stable cell sizes, and the outcome set spans the care pathway (access → prevention → diagnosis → control). Reading across each row asks “within a given BLI or income level, does moving from low to high diet quality materially change risk?”; reading down each column asks “for similar diet quality, how much do behavior load and income still shape who gets diagnosed and controlled?” The code below constructs these 3×3 tables and renders the tiles.
# helpers: pull per-day kcal, saturated fat, and fiber columns
diet_kcal_cols <- names(df1)[stringr::str_detect(names(df1), "^DR[12].*KCAL")]
diet_sfat_cols <- names(df1)[stringr::str_detect(names(df1), "^DR[12].*SFAT")]
diet_fibe_cols <- names(df1)[stringr::str_detect(names(df1), "^DR[12].*(FIBE|DFIB)")]
row_mean_numeric <- function(data, cols){
if (!length(cols)) return(rep(NA_real_, nrow(data)))
rowMeans(
lapply(cols, function(cn) suppressWarnings(as.numeric(data[[cn]]))) |> as.data.frame(),
na.rm = TRUE
)
}
rescale01 <- function(x){
r <- range(x, na.rm = TRUE)
if (!all(is.finite(r)) || diff(r) == 0) return(rep(NA_real_, length(x)))
(x - r[1]) / diff(r)
}
# recompute diet_kcal / sat fat / fiber and a combined diet_score
df1 <- df1 %>%
mutate(
diet_kcal = row_mean_numeric(cur_data(), diet_kcal_cols),
diet_sfat = row_mean_numeric(cur_data(), diet_sfat_cols),
diet_fibe = row_mean_numeric(cur_data(), diet_fibe_cols),
satfat_dens = diet_sfat / pmax(diet_kcal, 1),
fiber_dens = diet_fibe / pmax(diet_kcal, 1),
diet_score = rescale01(-satfat_dens) + rescale01(fiber_dens)
)
# cut diet_score into three ordered groups: Low / Mid / High
qs_diet <- quantile(df1$diet_score, c(0, 1/3, 2/3, 1), na.rm = TRUE)
df1$diet_q3 <- cut(
df1$diet_score,
qs_diet,
include.lowest = TRUE,
labels = c("Low diet","Mid diet","High diet")
)
Next I define a 5-level BLI band (bli5) if it does not already exist, and a 3-level PIR band (pir3). These are the y-axis stratifiers for the two panels of Figure A.
# BLI quintiles (if not already present in df1)
if (!("bli_q" %in% names(df1)) || all(is.na(df1$bli_q))) {
smk_now <- if ("smk_now" %in% names(df1)) df1$smk_now else 0L
heavy_alc <- if ("heavy_alc" %in% names(df1)) df1$heavy_alc else 0L
sleep_flag <- if ("sleep_flag" %in% names(df1)) df1$sleep_flag else 0L
low_pa <- if ("low_pa" %in% names(df1)) df1$low_pa else 0L
dep_flag <- if ("dep_flag" %in% names(df1)) df1$dep_flag else 0L
# low diet component: lowest third of diet_score if needed
qd <- suppressWarnings(quantile(df1$diet_score, 1/3, na.rm = TRUE))
low_diet <- if ("low_diet" %in% names(df1)) {
df1$low_diet
} else {
as.integer(!is.na(df1$diet_score) & df1$diet_score <= qd)
}
df1$bli_raw <- rowSums(
cbind(smk_now, heavy_alc, sleep_flag, low_pa, dep_flag, low_diet),
na.rm = TRUE
)
qs_bli <- quantile(df1$bli_raw, seq(0, 1, 0.2), na.rm = TRUE)
df1$bli5 <- cut(
df1$bli_raw,
qs_bli,
include.lowest = TRUE,
labels = c("BLI Q1 (lowest)","Q2","Q3","Q4","BLI Q5 (highest)")
)
} else {
# if bli_q already exists, just reuse it as bli5
df1$bli5 <- df1$bli_q
}
df1$bli5 <- forcats::fct_inorder(df1$bli5)
# reorder levels so "highest burden" appears at the top of the tile grid
bli_levels <- c("BLI Q5 (highest)","Q4","Q3","Q2","BLI Q1 (lowest)")
df1$bli5 <- factor(df1$bli5, levels = bli_levels)
# PIR tertiles: either reuse pir_grp or cut numeric PIR directly
if ("pir_grp" %in% names(df1) && !all(is.na(df1$pir_grp))) {
df1$pir3 <- factor(df1$pir_grp, levels = c("Low PIR","Mid PIR","High PIR"))
} else {
qs_pir <- quantile(df1$pir, c(0, 1/3, 2/3, 1), na.rm = TRUE)
df1$pir3 <- cut(
df1$pir,
qs_pir,
include.lowest = TRUE,
labels = c("Low PIR","Mid PIR","High PIR")
)
}
With diet_q3, bli5, and pir3 defined, I now create a helper make_2way() that:
restricts to complete cases of the x and y banding variables,
defines binary outcome flags for each care/outcome measure,
computes the cell size n and mean prevalence for each outcome,
fills in any missing tile combinations so the grid is always 3×3.
I then build two summary tables: TAB_DIET_BLI (diet × BLI) and TAB_DIET_PIR (diet × PIR).
make_2way <- function(data, xvar, yvar){
data %>%
filter(!is.na(.data[[xvar]]), !is.na(.data[[yvar]])) %>%
mutate(
access = as.integer(access %in% 1),
prev_bt = as.integer(btest %in% 1),
dx_htn = as.integer(dx_htn %in% 1),
dx_dm = as.integer(dx_dm %in% 1),
a1c_hi = as.integer(!is.na(a1c) & a1c >= 7),
bp_uncontrolled = as.integer(
!is.na(sbp) & !is.na(dbp) & (sbp >= 140 | dbp >= 90)
)
) %>%
group_by(.data[[yvar]], .data[[xvar]]) %>%
summarise(
n = dplyr::n(),
p_access = mean(access, na.rm = TRUE),
p_prev_bt = mean(prev_bt, na.rm = TRUE),
p_htn = mean(dx_htn, na.rm = TRUE),
p_dm = mean(dx_dm, na.rm = TRUE),
p_bp_unc = mean(bp_uncontrolled, na.rm = TRUE),
p_a1c7 = mean(a1c_hi, na.rm = TRUE),
.groups = "drop"
) %>%
# standardize x/y column names for plotting
rename(y = 1, x = 2) %>%
mutate(
x = forcats::fct_relevel(x, "Low diet","Mid diet","High diet"),
y = if (identical(yvar, "bli5")) {
factor(y, levels = bli_levels)
} else {
forcats::fct_inorder(y)
}
) %>%
# explicitly complete the 3×3 grid so tiles are never missing
tidyr::complete(
x = factor(
c("Low diet","Mid diet","High diet"),
levels = c("Low diet","Mid diet","High diet")
),
y = if (identical(yvar, "bli5")) {
factor(bli_levels, levels = bli_levels)
} else {
y
},
fill = list(
n = 0L,
p_access = NA_real_,
p_prev_bt = NA_real_,
p_htn = NA_real_,
p_dm = NA_real_,
p_bp_unc = NA_real_,
p_a1c7 = NA_real_
)
)
}
# Construct the two 3×3 tables used in Figure A
TAB_DIET_BLI <- make_2way(df1, "diet_q3", "bli5")
TAB_DIET_PIR <- make_2way(df1, "diet_q3", "pir3")
Finally, I define a plotting helper mk_tile() that converts one of these tables into a 6-panel layout (one panel per outcome) and a wrapper panel_from_tab() that arranges the panels using patchwork. The panel_diet_bli and panel_diet_pir objects are the final Figure A panels.
library(patchwork)
palette_low <- "#EEF2F7"
palette_high <- "#1F3A8A"
fmt_pct <- scales::percent_format(accuracy = 1)
mk_tile <- function(dat, value, title){
dd <- dat %>%
mutate(
val = .data[[value]],
lab = ifelse(is.na(val), "", fmt_pct(val))
)
# row-level sample size labels (n=…) on the right margin
row_lab <- dd %>%
group_by(y) %>%
summarise(nn = sum(n, na.rm = TRUE), .groups = "drop")
ggplot(dd, aes(x = x, y = y, fill = val)) +
geom_tile(color = "white", linewidth = 0.6) +
geom_text(
aes(label = lab),
size = 3.2, fontface = "bold", color = "white"
) +
geom_text(
data = row_lab,
aes(x = 3.55, y = y, label = paste0("n=", scales::comma(nn))),
inherit.aes = FALSE,
hjust = 0, vjust = 0.5,
size = 3.1, color = "grey20"
) +
scale_fill_gradient(
limits = c(0, 1),
low = palette_low,
high = palette_high,
na.value = "grey85",
name = "% Yes",
labels = scales::percent_format(accuracy = 1)
) +
labs(
x = "Diet quality (tertiles)",
y = NULL,
title = title
) +
coord_cartesian(xlim = c(1, 3.9), clip = "off") +
theme_minimal(base_size = 12) +
theme(
panel.grid = element_blank(),
plot.title = element_text(face = "bold", hjust = 0.5),
axis.title.y = element_text(margin = margin(r = 6)),
plot.margin = margin(t = 6, r = 36, b = 6, l = 6)
)
}
panel_from_tab <- function(tab, big_title){
p1 <- mk_tile(tab, "p_access", "Access (seen clinician)")
p2 <- mk_tile(tab, "p_prev_bt", "Prevention (blood test 12m)")
p3 <- mk_tile(tab, "p_htn", "Dx: Hypertension (ever told)")
p4 <- mk_tile(tab, "p_dm", "Dx: Diabetes (ever told)")
p5 <- mk_tile(tab, "p_bp_unc", "Uncontrolled BP (SBP≥140 / DBP≥90)")
p6 <- mk_tile(tab, "p_a1c7", "A1c ≥ 7%")
(p1 | p2 | p3) /
(p4 | p5 | p6) +
plot_annotation(
title = big_title,
theme = theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5)
)
)
}
panel_diet_bli <- panel_from_tab(
TAB_DIET_BLI,
"Diet Quality × Behavior Load (BLI) outcome tiles"
)
panel_diet_pir <- panel_from_tab(
TAB_DIET_PIR,
"Diet Quality × Income (PIR) outcome tiles"
)
# Optionally print in the knitted document:
print(panel_diet_bli)
print(panel_diet_pir)
Numeric Summaries: To complement the heatmaps, I tabulate the underlying cell counts and prevalence, then compute simple contrasts between low and high diet quality within each behavior load (or income) stratum. I also run an ordinal trend test treating diet tertile as a 1–3 score.
# 10.1 Numeric summaries for Figure A tiles
# Ensure global binary flags exist
df1 <- df1 %>%
mutate(
bp_uncontrolled = as.integer(!is.na(sbp) & !is.na(dbp) & (sbp >= 140 | dbp >= 90)),
a1c_hi = as.integer(!is.na(a1c) & a1c >= 7)
)
# Helper to make a long cell table from a TAB_* object
make_figA_cells <- function(tab, strat_label) {
tab %>%
dplyr::select(y, x, n, dplyr::starts_with("p_")) %>%
tidyr::pivot_longer(
cols = dplyr::starts_with("p_"),
names_to = "outcome_short",
values_to = "prop"
) %>%
dplyr::mutate(
outcome = dplyr::recode(
outcome_short,
p_access = "Access (seen clinician, 12m)",
p_prev_bt = "Prevention (blood test, 12m)",
p_htn = "Dx: Hypertension (ever told)",
p_dm = "Dx: Diabetes (ever told)",
p_bp_unc = "Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90)",
p_a1c7 = "A1c ≥ 7%"
),
pct = 100 * prop,
n_yes = round(prop * n),
diet = x,
strata = y,
strat_type = strat_label
) %>%
dplyr::select(strat_type, strata, diet, outcome, n, n_yes, pct) %>%
dplyr::arrange(strat_type, outcome, strata, diet)
}
figA_cells_bli <- make_figA_cells(TAB_DIET_BLI, "BLI quintile")
figA_cells_pir <- make_figA_cells(TAB_DIET_PIR, "PIR tertile")
# Example: inspect a subset of the table in the knitted doc
knitr::kable(
dplyr::slice_head(figA_cells_bli, n = 18),
digits = 1,
caption = "Selected cells from Figure A (Diet × BLI): n and % with each outcome."
)
| strat_type | strata | diet | outcome | n | n_yes | pct |
|---|---|---|---|---|---|---|
| BLI quintile | BLI Q5 (highest) | Low diet | A1c ≥ 7% | 1246 | 120 | 9.6 |
| BLI quintile | BLI Q5 (highest) | Mid diet | A1c ≥ 7% | 368 | 26 | 7.1 |
| BLI quintile | BLI Q5 (highest) | High diet | A1c ≥ 7% | 296 | 20 | 6.8 |
| BLI quintile | Q4 | Low diet | A1c ≥ 7% | 804 | 30 | 3.7 |
| BLI quintile | Q4 | Mid diet | A1c ≥ 7% | 406 | 28 | 6.9 |
| BLI quintile | Q4 | High diet | A1c ≥ 7% | 384 | 42 | 10.9 |
| BLI quintile | Q3 | Low diet | A1c ≥ 7% | 182 | 7 | 3.8 |
| BLI quintile | Q3 | Mid diet | A1c ≥ 7% | 475 | 22 | 4.6 |
| BLI quintile | Q3 | High diet | A1c ≥ 7% | 495 | 19 | 3.8 |
| BLI quintile | Q2 | Low diet | A1c ≥ 7% | 0 | NA | NA |
| BLI quintile | Q2 | Mid diet | A1c ≥ 7% | 501 | 8 | 1.6 |
| BLI quintile | Q2 | High diet | A1c ≥ 7% | 523 | 13 | 2.5 |
| BLI quintile | BLI Q1 (lowest) | Low diet | A1c ≥ 7% | 0 | NA | NA |
| BLI quintile | BLI Q1 (lowest) | Mid diet | A1c ≥ 7% | 482 | 17 | 3.5 |
| BLI quintile | BLI Q1 (lowest) | High diet | A1c ≥ 7% | 534 | 22 | 4.1 |
| BLI quintile | BLI Q5 (highest) | Low diet | Access (seen clinician, 12m) | 1246 | 214 | 17.2 |
| BLI quintile | BLI Q5 (highest) | Mid diet | Access (seen clinician, 12m) | 368 | 92 | 25.0 |
| BLI quintile | BLI Q5 (highest) | High diet | Access (seen clinician, 12m) | 296 | 62 | 20.9 |
knitr::kable(
dplyr::slice_head(figA_cells_pir, n = 18),
digits = 1,
caption = "Selected cells from Figure A (Diet × PIR): n and % with each outcome."
)
| strat_type | strata | diet | outcome | n | n_yes | pct |
|---|---|---|---|---|---|---|
| PIR tertile | Low PIR | Low diet | A1c ≥ 7% | 641 | 53 | 8.3 |
| PIR tertile | Low PIR | Mid diet | A1c ≥ 7% | 604 | 29 | 4.8 |
| PIR tertile | Low PIR | High diet | A1c ≥ 7% | 563 | 47 | 8.3 |
| PIR tertile | Mid PIR | Low diet | A1c ≥ 7% | 726 | 59 | 8.1 |
| PIR tertile | Mid PIR | Mid diet | A1c ≥ 7% | 677 | 31 | 4.6 |
| PIR tertile | Mid PIR | High diet | A1c ≥ 7% | 588 | 30 | 5.1 |
| PIR tertile | High PIR | Low diet | A1c ≥ 7% | 615 | 32 | 5.2 |
| PIR tertile | High PIR | Mid diet | A1c ≥ 7% | 691 | 26 | 3.8 |
| PIR tertile | High PIR | High diet | A1c ≥ 7% | 812 | 22 | 2.7 |
| PIR tertile | Low PIR | Low diet | Access (seen clinician, 12m) | 641 | 109 | 17.0 |
| PIR tertile | Low PIR | Mid diet | Access (seen clinician, 12m) | 604 | 118 | 19.5 |
| PIR tertile | Low PIR | High diet | Access (seen clinician, 12m) | 563 | 88 | 15.6 |
| PIR tertile | Mid PIR | Low diet | Access (seen clinician, 12m) | 726 | 93 | 12.8 |
| PIR tertile | Mid PIR | Mid diet | Access (seen clinician, 12m) | 677 | 91 | 13.4 |
| PIR tertile | Mid PIR | High diet | Access (seen clinician, 12m) | 588 | 72 | 12.2 |
| PIR tertile | High PIR | Low diet | Access (seen clinician, 12m) | 615 | 101 | 16.4 |
| PIR tertile | High PIR | Mid diet | Access (seen clinician, 12m) | 691 | 116 | 16.8 |
| PIR tertile | High PIR | High diet | Access (seen clinician, 12m) | 812 | 119 | 14.7 |
Interpretations: In the BLI heatmap cells, adults in the worst behavior-load group (BLI Q5) show a clear A1c gradient by diet—A1c ≥ 7% falls from ~9.6% in low diet to ~7% in mid/high diet—while access to care within this high-risk group is actually highest in the mid-diet tertile (~25% vs ~17–21%). Across PIR strata, mid-diet groups generally have the lowest A1c ≥ 7% prevalence (e.g., 4.6–4.8% vs ~8% in low diet), and in high-income adults the combination of high PIR and high diet quality yields the lowest A1c risk (~2.7%) with only modest differences in recent clinician contact across diet tertiles (roughly 12–20%).
Within each BLI quintile (or PIR tertile), I then compare outcome prevalence in the high versus low diet tertiles. This puts a numeric effect size under each pair of tiles in the heatmaps.
# Contrast High vs Low diet within each stratum --------------------------
contrast_high_low <- function(cells_tbl) {
cells_tbl %>%
dplyr::filter(diet %in% c("Low diet","High diet")) %>%
dplyr::select(strat_type, strata, outcome, diet, pct) %>%
tidyr::pivot_wider(names_from = diet, values_from = pct) %>%
dplyr::mutate(
diff_high_minus_low = `High diet` - `Low diet`
) %>%
dplyr::arrange(strat_type, outcome, strata)
}
diet_contrast_bli <- contrast_high_low(figA_cells_bli)
diet_contrast_pir <- contrast_high_low(figA_cells_pir)
knitr::kable(
diet_contrast_bli,
digits = 1,
caption = "Figure A (Diet × BLI): difference in prevalence (High – Low diet) within each BLI quintile."
)
| strat_type | strata | outcome | Low diet | High diet | diff_high_minus_low |
|---|---|---|---|---|---|
| BLI quintile | BLI Q5 (highest) | A1c ≥ 7% | 9.6 | 6.8 | -2.9 |
| BLI quintile | Q4 | A1c ≥ 7% | 3.7 | 10.9 | 7.2 |
| BLI quintile | Q3 | A1c ≥ 7% | 3.8 | 3.8 | 0.0 |
| BLI quintile | Q2 | A1c ≥ 7% | NA | 2.5 | NA |
| BLI quintile | BLI Q1 (lowest) | A1c ≥ 7% | NA | 4.1 | NA |
| BLI quintile | BLI Q5 (highest) | Access (seen clinician, 12m) | 17.2 | 20.9 | 3.8 |
| BLI quintile | Q4 | Access (seen clinician, 12m) | 12.4 | 17.4 | 5.0 |
| BLI quintile | Q3 | Access (seen clinician, 12m) | 12.1 | 14.3 | 2.3 |
| BLI quintile | Q2 | Access (seen clinician, 12m) | NA | 11.5 | NA |
| BLI quintile | BLI Q1 (lowest) | Access (seen clinician, 12m) | NA | 9.9 | NA |
| BLI quintile | BLI Q5 (highest) | Dx: Diabetes (ever told) | 16.7 | 17.2 | 0.5 |
| BLI quintile | Q4 | Dx: Diabetes (ever told) | 7.7 | 15.6 | 7.9 |
| BLI quintile | Q3 | Dx: Diabetes (ever told) | 8.8 | 6.3 | -2.5 |
| BLI quintile | Q2 | Dx: Diabetes (ever told) | NA | 5.5 | NA |
| BLI quintile | BLI Q1 (lowest) | Dx: Diabetes (ever told) | NA | 6.6 | NA |
| BLI quintile | BLI Q5 (highest) | Dx: Hypertension (ever told) | 39.5 | 42.2 | 2.7 |
| BLI quintile | Q4 | Dx: Hypertension (ever told) | 19.8 | 40.1 | 20.3 |
| BLI quintile | Q3 | Dx: Hypertension (ever told) | 18.7 | 24.8 | 6.2 |
| BLI quintile | Q2 | Dx: Hypertension (ever told) | NA | 19.7 | NA |
| BLI quintile | BLI Q1 (lowest) | Dx: Hypertension (ever told) | NA | 18.4 | NA |
| BLI quintile | BLI Q5 (highest) | Prevention (blood test, 12m) | 35.3 | 42.2 | 6.9 |
| BLI quintile | Q4 | Prevention (blood test, 12m) | 24.1 | 40.1 | 16.0 |
| BLI quintile | Q3 | Prevention (blood test, 12m) | 22.5 | 33.5 | 11.0 |
| BLI quintile | Q2 | Prevention (blood test, 12m) | NA | 33.3 | NA |
| BLI quintile | BLI Q1 (lowest) | Prevention (blood test, 12m) | NA | 31.5 | NA |
| BLI quintile | BLI Q5 (highest) | Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90) | 17.0 | 16.6 | -0.5 |
| BLI quintile | Q4 | Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90) | 10.0 | 18.5 | 8.5 |
| BLI quintile | Q3 | Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90) | 8.8 | 11.7 | 2.9 |
| BLI quintile | Q2 | Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90) | NA | 11.7 | NA |
| BLI quintile | BLI Q1 (lowest) | Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90) | NA | 11.2 | NA |
knitr::kable(
diet_contrast_pir,
digits = 1,
caption = "Figure A (Diet × PIR): difference in prevalence (High – Low diet) within each PIR tertile."
)
| strat_type | strata | outcome | Low diet | High diet | diff_high_minus_low |
|---|---|---|---|---|---|
| PIR tertile | Low PIR | A1c ≥ 7% | 8.3 | 8.3 | 0.1 |
| PIR tertile | Mid PIR | A1c ≥ 7% | 8.1 | 5.1 | -3.0 |
| PIR tertile | High PIR | A1c ≥ 7% | 5.2 | 2.7 | -2.5 |
| PIR tertile | Low PIR | Access (seen clinician, 12m) | 17.0 | 15.6 | -1.4 |
| PIR tertile | Mid PIR | Access (seen clinician, 12m) | 12.8 | 12.2 | -0.6 |
| PIR tertile | High PIR | Access (seen clinician, 12m) | 16.4 | 14.7 | -1.8 |
| PIR tertile | Low PIR | Dx: Diabetes (ever told) | 14.5 | 13.7 | -0.8 |
| PIR tertile | Mid PIR | Dx: Diabetes (ever told) | 13.2 | 9.4 | -3.9 |
| PIR tertile | High PIR | Dx: Diabetes (ever told) | 11.4 | 5.4 | -6.0 |
| PIR tertile | Low PIR | Dx: Hypertension (ever told) | 29.5 | 29.5 | 0.0 |
| PIR tertile | Mid PIR | Dx: Hypertension (ever told) | 32.0 | 27.6 | -4.4 |
| PIR tertile | High PIR | Dx: Hypertension (ever told) | 30.2 | 24.5 | -5.7 |
| PIR tertile | Low PIR | Prevention (blood test, 12m) | 22.5 | 26.6 | 4.2 |
| PIR tertile | Mid PIR | Prevention (blood test, 12m) | 29.2 | 33.8 | 4.6 |
| PIR tertile | High PIR | Prevention (blood test, 12m) | 40.2 | 44.2 | 4.0 |
| PIR tertile | Low PIR | Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90) | 12.9 | 16.5 | 3.6 |
| PIR tertile | Mid PIR | Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90) | 15.2 | 13.6 | -1.5 |
| PIR tertile | High PIR | Uncontrolled BP (SBP ≥ 140 or DBP ≥ 90) | 13.2 | 11.2 | -2.0 |
In the BLI heatmap cells, adults in the worst behavior-load group (BLI Q5) show a clear A1c gradient by diet—A1c ≥ 7% falls from ~9.6% in low diet to ~7% in mid/high diet—while access to care within this high-risk group is actually highest in the mid-diet tertile (~25% vs ~17–21%). Across PIR strata, mid-diet groups generally have the lowest A1c ≥ 7% prevalence (e.g., 4.6–4.8% vs ~8% in low diet), and in high-income adults the combination of high PIR and high diet quality yields the lowest A1c risk (~2.7%) with only modest differences in recent clinician contact across diet tertiles (roughly 12–20%).
Figure A (continued): Diet × Age/Sex/Race tiles
To further examine heterogeneity, I replicate the tile structure for diet quality × age, × sex, and × race/ethnicity. The logic is: Compress continuous or coded variables into a small number of interpretable groups (3 age bands, 2 sex levels, 5+ race/ethnicity categories). Reuse the same make_2way() + panel_from_tab() machinery from Figure A to compute cell-wise prevalences. Generate three separate panels: diet × age, diet × sex, and diet × race/ethnicity.
# --- 7.1 Age bands: compress RIDAGEYR into three adult age groups ---
df1 <- df1 %>%
mutate(
age3 = case_when(
!is.na(RIDAGEYR) & RIDAGEYR >= 18 & RIDAGEYR < 35 ~ "Younger (18–34)",
!is.na(RIDAGEYR) & RIDAGEYR >= 35 & RIDAGEYR < 65 ~ "Middle (35–64)",
!is.na(RIDAGEYR) & RIDAGEYR >= 65 ~ "Older (65+)"
),
age3 = factor(
age3,
levels = c("Younger (18–34)","Middle (35–64)","Older (65+)")
)
)
# Build the diet × age tile table and enforce a consistent factor order
TAB_DIET_AGE <- make_2way(df1, "diet_q3", "age3") %>%
mutate(
x = forcats::fct_relevel(x, "Low diet","Mid diet","High diet"),
y = forcats::fct_relevel(y, "Younger (18–34)","Middle (35–64)","Older (65+)")
) %>%
# make sure all combinations of age band × diet tercile are present
tidyr::complete(y = levels(y), x = levels(x)) %>%
arrange(y, x) %>%
group_by(y) %>%
mutate(
# optional: per-row sample size, if needed later for annotation
n_row = sum(replace_na(n, 0))
) %>%
ungroup()
panel_diet_age <- panel_from_tab(
TAB_DIET_AGE,
"Diet Quality × Age outcome tiles"
)
# --- 7.2 Sex: simple Female / Male factor for tile plots ---
df1$sex2 <- factor(df1$sex, levels = c("Female","Male"))
# --- 7.3 Race/ethnicity: collapse RIDRETH2/3 into a 5+ group variable ---
# Identify which RIDRETH* column is present, preferring vars$race if defined
cand_cols <- c("RIDRETH3","RIDRETH2","RIDRETH1")
if (exists("vars") && !is.null(vars$race) && !is.na(vars$race)) {
cand_cols <- c(vars$race, cand_cols)
}
rid_col <- intersect(cand_cols, names(df1))
rid_col <- if (length(rid_col)) rid_col[1] else NA_character_
# Read the numeric race code from the chosen RIDRETH* column
race_code <- rep(NA_real_, nrow(df1))
if (!is.na(rid_col)) {
race_code <- suppressWarnings(as.numeric(df1[[rid_col]]))
}
# Map NHANES race/ethnicity codes into human-readable labels
race_lbl <- dplyr::case_when(
race_code == 1 ~ "Mexican American",
race_code == 2 ~ "Other Hispanic",
race_code == 3 ~ "NH White",
race_code == 4 ~ "NH Black",
race_code == 6 ~ "NH Asian",
race_code == 7 ~ "Other/Multiracial",
TRUE ~ NA_character_
)
# Store as a factor with a consistent ordering for tiles
df1$race5 <- factor(
race_lbl,
levels = c(
"NH White","NH Black","NH Asian",
"Mexican American","Other Hispanic","Other/Multiracial"
)
)
# --- 7.4 Build diet × Age/Sex/Race tables and panels using the same machinery as Figure A ---
TAB_DIET_AGE <- make_2way(df1, "diet_q3", "age3")
TAB_DIET_SEX <- make_2way(df1, "diet_q3", "sex2")
TAB_DIET_RACE <- make_2way(df1, "diet_q3", "race5")
panel_diet_age <- panel_from_tab(TAB_DIET_AGE, "Diet Quality × Age outcome tiles")
panel_diet_sex <- panel_from_tab(TAB_DIET_SEX, "Diet Quality × Sex outcome tiles")
panel_diet_race <- panel_from_tab(TAB_DIET_RACE, "Diet Quality × Race/Ethnicity outcome tiles")
# Print the three demographic panels
print(panel_diet_age)
print(panel_diet_sex)
print(panel_diet_race)
## Selected cells from Figure A (Diet × Age): n and % with each outcome
TAB_AGE_A1C <- TAB_DIET_AGE %>%
transmute(
strat_type = "Age band",
strata = as.character(y),
diet = as.character(x),
outcome = "A1c ≥ 7%",
n = n,
n_yes = round(p_a1c7 * n),
pct = round(100 * p_a1c7, 1)
)
TAB_AGE_ACCESS <- TAB_DIET_AGE %>%
transmute(
strat_type = "Age band",
strata = as.character(y),
diet = as.character(x),
outcome = "Access (seen clinician, 12m)",
n = n,
n_yes = round(p_access * n),
pct = round(100 * p_access, 1)
)
TAB_DIET_AGE_NUM <- bind_rows(TAB_AGE_A1C, TAB_AGE_ACCESS)
print(knitr::kable(TAB_DIET_AGE_NUM))
##
##
## |strat_type |strata |diet |outcome | n| n_yes| pct|
## |:----------|:---------------|:---------|:----------------------------|---:|-----:|----:|
## |Age band |Younger (18–34) |Low diet |A1c ≥ 7% | 333| 6| 1.8|
## |Age band |Middle (35–64) |Low diet |A1c ≥ 7% | 752| 75| 10.0|
## |Age band |Older (65+) |Low diet |A1c ≥ 7% | 560| 76| 13.6|
## |Age band |Younger (18–34) |Mid diet |A1c ≥ 7% | 358| 6| 1.7|
## |Age band |Middle (35–64) |Mid diet |A1c ≥ 7% | 722| 51| 7.1|
## |Age band |Older (65+) |Mid diet |A1c ≥ 7% | 517| 44| 8.5|
## |Age band |Younger (18–34) |High diet |A1c ≥ 7% | 298| 1| 0.3|
## |Age band |Middle (35–64) |High diet |A1c ≥ 7% | 819| 69| 8.4|
## |Age band |Older (65+) |High diet |A1c ≥ 7% | 627| 46| 7.3|
## |Age band |Younger (18–34) |Low diet |Access (seen clinician, 12m) | 333| 81| 24.3|
## |Age band |Middle (35–64) |Low diet |Access (seen clinician, 12m) | 752| 136| 18.1|
## |Age band |Older (65+) |Low diet |Access (seen clinician, 12m) | 560| 40| 7.1|
## |Age band |Younger (18–34) |Mid diet |Access (seen clinician, 12m) | 358| 82| 22.9|
## |Age band |Middle (35–64) |Mid diet |Access (seen clinician, 12m) | 722| 142| 19.7|
## |Age band |Older (65+) |Mid diet |Access (seen clinician, 12m) | 517| 42| 8.1|
## |Age band |Younger (18–34) |High diet |Access (seen clinician, 12m) | 298| 77| 25.8|
## |Age band |Middle (35–64) |High diet |Access (seen clinician, 12m) | 819| 133| 16.2|
## |Age band |Older (65+) |High diet |Access (seen clinician, 12m) | 627| 56| 8.9|
## Figure A (Diet × Age): difference in prevalence (High – Low diet) within each age band
AGE_DIFF <- TAB_DIET_AGE %>%
select(y, x, p_a1c7, p_access) %>%
mutate(
a1c_pct = 100 * p_a1c7,
access_pct = 100 * p_access
) %>%
select(-p_a1c7, -p_access) %>%
pivot_longer(
cols = c(a1c_pct, access_pct),
names_to = "metric", values_to = "pct"
) %>%
filter(x %in% c("Low diet","High diet")) %>%
mutate(diet = x) %>%
select(-x) %>%
pivot_wider(names_from = diet, values_from = pct) %>%
mutate(
diff_high_minus_low = `High diet` - `Low diet`,
strat_type = "Age band",
outcome = recode(metric,
a1c_pct = "A1c ≥ 7%",
access_pct = "Access (seen clinician, 12m)")
) %>%
select(strat_type, strata = y, outcome,
`Low diet`, `High diet`, diff_high_minus_low)
print(knitr::kable(AGE_DIFF, digits = 1))
##
##
## |strat_type |strata |outcome | Low diet| High diet| diff_high_minus_low|
## |:----------|:---------------|:----------------------------|--------:|---------:|-------------------:|
## |Age band |Younger (18–34) |A1c ≥ 7% | 1.8| 0.3| -1.5|
## |Age band |Younger (18–34) |Access (seen clinician, 12m) | 24.3| 25.8| 1.5|
## |Age band |Middle (35–64) |A1c ≥ 7% | 10.0| 8.4| -1.5|
## |Age band |Middle (35–64) |Access (seen clinician, 12m) | 18.1| 16.2| -1.8|
## |Age band |Older (65+) |A1c ≥ 7% | 13.6| 7.3| -6.2|
## |Age band |Older (65+) |Access (seen clinician, 12m) | 7.1| 8.9| 1.8|
Within age bands, the tiles and difference table agree that age dominates the gradients, with diet adding only small corrections. For A1c≥7%, prevalence rises sharply from younger to older adults (e.g., 2–3% in 18–34 vs ~10–14% in 35–64 and 65+), while moving from low to high diet quality mostly nudges rates down by 1–2 percentage points, and more noticeably in the oldest group (13.6%→7.3%, –6.2 pp). Access shows the opposite pattern: younger adults have much higher visit rates overall (≈24–26%) than older adults (≈7–9%), and within each age band the low– vs high-diet differences are tiny (–1.8 to +1.8 pp), mirroring the almost flat color tiles across diet tertiles.
Summary: Putting BLI, PIR, and age together, Figure A plus these tables suggest that behavior load, income, and age are the primary axes of variation, while diet tertiles behave more like a modest modifier than a dominant driver. High-diet groups tend to have slightly better access and somewhat lower A1c/diabetes risk in specific strata (especially high-BLI and older adults), but the tiles and numbers both show that these effects are small compared with the large between-strata gaps—supporting the idea that diet quality is partly tagging broader patterns of healthcare contact and underlying risk rather than cleanly separating “healthy” and “unhealthy” cardiometabolic profiles on its own.
Panel B: Diet × Income tiles stratified by age
In Panel A we saw that income and overall behavior load largely set the baseline for diagnosis and control, with diet tertiles adding only modest corrections. One obvious follow-up is whether these patterns are stable across the life course, especially for access and preventive blood testing, where age strongly shapes insurance coverage, screening guidelines, and chances to interact with the health system.
Panel B therefore repeats the Diet × PIR tiles separately for younger (18–34), middle-aged (35–64), and older (65+) adults. Within each age band we look at the same outcomes as in Panel A, but we pay particular attention to the Access (seen clinician, 12m) and Prevention (blood test, 12m) tiles, asking whether age modifies the income gradient or the added effect of diet quality on getting into care and being screened.
make_diet_pir_tab <- function(data_subset){
make_2way(data_subset, "diet_q3", "pir3") %>%
mutate(
x = forcats::fct_relevel(x, "Low diet","Mid diet","High diet"),
y = forcats::fct_relevel(y, "Low PIR","Mid PIR","High PIR")
) %>%
tidyr::complete(y = levels(y), x = levels(x)) %>%
arrange(y, x) %>%
group_by(y) %>%
mutate(n_row = sum(replace_na(n, 0))) %>%
ungroup()
}
panel_diet_pir_by_age <- function(age_label){
dsub <- df1 %>% filter(age3 == age_label)
tab <- make_diet_pir_tab(dsub)
panel_from_tab(tab, paste0("Diet Quality × Income (PIR) outcome tiles — ", age_label))
}
p_age_18_34 <- panel_diet_pir_by_age("Younger (18–34)")
p_age_35_64 <- panel_diet_pir_by_age("Middle (35–64)")
p_age_65p <- panel_diet_pir_by_age("Older (65+)")
# Optionally print:
print(p_age_18_34)
print(p_age_35_64)
print(p_age_65p)
Panel B: what we learn
Across all three age groups, the tiles confirm that income still structures access and prevention more than diet does, but the gradient changes with age. Among younger adults, access and blood testing are relatively low and noisy, with only mild PIR differences and no consistent advantage for high-diet groups, suggesting that many young adults remain outside regular care regardless of diet. In middle-aged adults, the income gradient becomes clearer: low-PIR groups show higher “access” but lower preventive testing than high-PIR peers, hinting that lower-income patients may visit clinicians for acute needs while higher-income peers are more likely to receive guideline-driven screening. By older ages, access is fairly high in all PIR bands and prevention tiles are uniformly dark, with high-PIR/high-diet cells still near the top—consistent with Medicare and age-based screening policies reducing, but not erasing, socioeconomic gaps.
Together with Panel A, Panel B suggests that structural factors (age, insurance, screening norms, income) shape who gets seen and tested, while diet quality operates more as a mild modifier within those strata rather than a primary driver of access or prevention.
Panel C Ridge plots: distributions of SBP and A1c Panels A and B focus on whether people are diagnosed, seen, or “uncontrolled” according to guideline cutoffs. But those binary outcomes sit on top of continuous risk distributions: some groups may have similar proportions of A1c ≥ 7%, for example, while still differing in their average A1c or in how heavy the high-risk tail is.
In Panel C, I therefore switch to ridge plots of SBP and A1c, stratified by the same social and behavioral axes as before (diet tertiles crossed with income or age). The question is:
Do higher diet quality scores shift the entire distribution of SBP or A1c downward within income and age strata, or do we mostly see small changes around the clinical thresholds (140/90 for BP, 7% for A1c)?
Are diet-related shifts comparable to those associated with income or age, or do the structural factors still dominate once we look at the continuous measurements?
I therefore, in Panel C, add ridge plots to visualize:
This helps distinguish whether differences are simple shifts or more complex distribution changes.
to_num <- function(x) suppressWarnings(as.numeric(x))
# If diet_ter is not already present, build tertiles of diet_score
if (!("diet_ter" %in% names(df1))) {
if (!("diet_score" %in% names(df1))) stop("diet_score missing.")
df1 <- df1 %>%
mutate(
diet_ter = ntile(diet_score, 3),
diet_ter = factor(diet_ter,
labels = c("Low diet score","Mid diet score","High diet score"))
)
}
# Ridge-plot backbone: SBP, A1c, BLI quintiles, and diet tertiles
ridg_dat <- df1 %>%
transmute(
sbp = to_num(sbp),
a1c = to_num(a1c),
bli_q = factor(bli_q,
levels = c("BLI Q1 (lowest)","Q2","Q3","Q4","BLI Q5 (highest)")),
diet_ter = factor(diet_ter,
levels = c("Low diet score","Mid diet score","High diet score"))
) %>%
mutate(
# Truncate to plausible clinical ranges for cleaner densities
sbp = ifelse(sbp >= 80 & sbp <= 200, sbp, NA_real_),
a1c = ifelse(a1c >= 4 & a1c <= 14, a1c, NA_real_)
)
Next I construct four core ridge plots:
SBP by BLI quintile, A1c by BLI quintile, SBP by diet tertile, A1c by diet tertile. Each plot overlays a vertical guideline at a standard clinical threshold.
p_sbp_bli <- ggplot(
filter(ridg_dat, !is.na(sbp), !is.na(bli_q)),
aes(x = sbp, y = bli_q, fill = bli_q)
) +
ggridges::geom_density_ridges(
scale = 1.2, rel_min_height = 0.01,
alpha = 0.85, color = "grey25"
) +
geom_vline(xintercept = 140, linetype = 2, linewidth = 0.7, color = "firebrick") +
scale_fill_manual(
values = c("#e0e7ff","#c7d2fe","#a5b4fc","#818cf8","#6366f1"),
guide = "none"
) +
labs(
title = "SBP distributions by Behavior Load (BLI)",
subtitle = "Dashed line at 140 mmHg",
x = "Systolic blood pressure (mmHg)",
y = NULL
) +
theme_ridges(font_size = 12, grid = TRUE)
p_a1c_bli <- ggplot(
filter(ridg_dat, !is.na(a1c), !is.na(bli_q)),
aes(x = a1c, y = bli_q, fill = bli_q)
) +
ggridges::geom_density_ridges(
scale = 1.2, rel_min_height = 0.01,
alpha = 0.85, color = "grey25"
) +
geom_vline(xintercept = 7, linetype = 2, linewidth = 0.7, color = "firebrick") +
scale_fill_manual(
values = c("#e0f2fe","#bae6fd","#7dd3fc","#38bdf8","#0ea5e9"),
guide = "none"
) +
labs(
title = "A1c distributions by Behavior Load (BLI)",
subtitle = "Dashed line at 7%",
x = "Hemoglobin A1c (%)",
y = NULL
) +
theme_ridges(font_size = 12, grid = TRUE)
p_sbp_diet <- ggplot(
filter(ridg_dat, !is.na(sbp), !is.na(diet_ter)),
aes(x = sbp, y = diet_ter, fill = diet_ter)
) +
ggridges::geom_density_ridges(
scale = 1.2, rel_min_height = 0.01,
alpha = 0.9, color = "grey25"
) +
geom_vline(xintercept = 140, linetype = 2, linewidth = 0.7, color = "firebrick") +
scale_fill_manual(
values = c("#fee2e2","#fecaca","#fca5a5"),
guide = "none"
) +
labs(
title = "SBP distributions by Diet quality (tertiles)",
subtitle = "Dashed line at 140 mmHg",
x = "Systolic blood pressure (mmHg)",
y = NULL
) +
theme_ridges(font_size = 12, grid = TRUE)
p_a1c_diet <- ggplot(
filter(ridg_dat, !is.na(a1c), !is.na(diet_ter)),
aes(x = a1c, y = diet_ter, fill = diet_ter)
) +
ggridges::geom_density_ridges(
scale = 1.2, rel_min_height = 0.01,
alpha = 0.9, color = "grey25"
) +
geom_vline(xintercept = 7, linetype = 2, linewidth = 0.7, color = "firebrick") +
scale_fill_manual(
values = c("#dcfce7","#bbf7d0","#86efac"),
guide = "none"
) +
labs(
title = "A1c distributions by Diet quality (tertiles)",
subtitle = "Dashed line at 7%",
x = "Hemoglobin A1c (%)",
y = NULL
) +
theme_ridges(font_size = 12, grid = TRUE)
# 2×2 layout: BLI gradients on top, diet tertiles on bottom
p_ridge <- (p_sbp_bli | p_a1c_bli) / (p_sbp_diet | p_a1c_diet)
Finally, I refine factor ordering and define helper functions to build within–BLI stratum ridge panels. These panels show how SBP and A1c vary across diet tertiles within a fixed BLI quintile, making it easier to compare diet effects at a given behavior load.
# Ensure consistent ordering for BLI quintiles and diet tertiles
ridg_dat$bli_q <- fct_relevel(
ridg_dat$bli_q,
"BLI Q1 (lowest)","Q2","Q3","Q4","BLI Q5 (highest)"
)
ridg_dat$diet_ter <- fct_relevel(
ridg_dat$diet_ter,
"Low diet score","Mid diet score","High diet score"
)
# Small builders: SBP and A1c ridges by diet tertile, with flexible titles
build_sbp_plot <- function(dat, title_suffix = "") {
ggplot(dat, aes(x = sbp, y = diet_ter, fill = diet_ter)) +
ggridges::geom_density_ridges(
scale = 1.2, rel_min_height = 0.01,
alpha = 0.90, color = "grey25"
) +
geom_vline(xintercept = 140, linetype = 2, linewidth = 0.7, color = "firebrick") +
scale_fill_manual(
values = c("#fee2e2","#fecaca","#fca5a5"),
guide = "none"
) +
labs(
title = paste0("SBP distributions by Diet quality", title_suffix),
subtitle = "Dashed line at 140 mmHg",
x = "Systolic blood pressure (mmHg)",
y = NULL
) +
theme_ridges(font_size = 12, grid = TRUE)
}
build_a1c_plot <- function(dat, title_suffix = "") {
ggplot(dat, aes(x = a1c, y = diet_ter, fill = diet_ter)) +
ggridges::geom_density_ridges(
scale = 1.2, rel_min_height = 0.01,
alpha = 0.90, color = "grey25"
) +
geom_vline(xintercept = 7, linetype = 2, linewidth = 0.7, color = "firebrick") +
scale_fill_manual(
values = c("#dcfce7","#bbf7d0","#86efac"),
guide = "none"
) +
labs(
title = paste0("A1c distributions by Diet quality", title_suffix),
subtitle = "Dashed line at 7%",
x = "Hemoglobin A1c (%)",
y = NULL
) +
theme_ridges(font_size = 12, grid = TRUE)
}
# For a given BLI quintile, build a 2-panel ridge figure:
# left = SBP × diet tertile
# right = A1c × diet tertile
make_bli_panel <- function(bli_label) {
sbp_sub <- ridg_dat %>%
filter(bli_q == bli_label, !is.na(diet_ter), !is.na(sbp))
a1c_sub <- ridg_dat %>%
filter(bli_q == bli_label, !is.na(diet_ter), !is.na(a1c))
left <- if (nrow(sbp_sub)) {
build_sbp_plot(
sbp_sub,
paste0(" (", bli_label, "; n=", nrow(sbp_sub), ")")
)
} else NULL
right <- if (nrow(a1c_sub)) {
build_a1c_plot(
a1c_sub,
paste0(" (", bli_label, "; n=", nrow(a1c_sub), ")")
)
} else NULL
fig <- if (!is.null(left) && !is.null(right)) {
left | right
} else if (!is.null(left)) {
left
} else if (!is.null(right)) {
right
} else {
ggplot() + theme_void() +
ggtitle(paste0(bli_label, " — no data"))
}
fig + plot_annotation(
title = paste0("Diet × Clinical distributions — ", bli_label),
theme = theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
)
)
}
# Precompute a panel for each BLI quintile (can print selectively)
panel_bli_q1 <- make_bli_panel("BLI Q1 (lowest)")
panel_bli_q2 <- make_bli_panel("Q2")
panel_bli_q3 <- make_bli_panel("Q3")
panel_bli_q4 <- make_bli_panel("Q4")
panel_bli_q5 <- make_bli_panel("BLI Q5 (highest)")
print(panel_bli_q1)
## Picking joint bandwidth of 4.39
## Picking joint bandwidth of 0.114
I also mirror this conditioning for PIR tertiles, creating SBP/A1c ridge plots by diet quality within income bands.
to_num <- function(x) suppressWarnings(as.numeric(x))
if (!("diet_ter" %in% names(df1))) {
stop("diet_ter is missing; build it from diet_score before running.")
}
df1$diet_ter <- fct_relevel(df1$diet_ter,
"Low diet score","Mid diet score","High diet score")
if (!("pir3" %in% names(df1)) || all(is.na(df1$pir3))) {
if (!("pir" %in% names(df1))) stop("Need numeric PIR to build pir3.")
qs_pir <- quantile(df1$pir, c(0, 1/3, 2/3, 1), na.rm = TRUE)
df1$pir3 <- cut(df1$pir, qs_pir, include.lowest = TRUE,
labels = c("Low PIR","Mid PIR","High PIR"))
}
df1$pir3 <- fct_relevel(df1$pir3, "Low PIR","Mid PIR","High PIR")
ridg_income <- df1 %>%
transmute(
sbp = to_num(sbp),
a1c = to_num(a1c),
pir3 = pir3,
diet_ter = diet_ter
) %>%
mutate(
sbp = ifelse(sbp >= 80 & sbp <= 200, sbp, NA_real_),
a1c = ifelse(a1c >= 4 & a1c <= 14, a1c, NA_real_)
)
build_sbp_plot <- function(dat, title_suffix="") {
ggplot(dat, aes(x=sbp, y=diet_ter, fill=diet_ter)) +
ggridges::geom_density_ridges(scale=1.2, rel_min_height=0.01,
alpha=0.90, color="grey25") +
geom_vline(xintercept=140, linetype=2, linewidth=0.7, color="firebrick") +
scale_fill_manual(values=c("#fee2e2","#fecaca","#fca5a5"), guide="none") +
labs(title=paste0("SBP distributions by Diet quality", title_suffix),
subtitle="Dashed line at 140 mmHg",
x="Systolic blood pressure (mmHg)", y=NULL) +
theme_ridges(font_size=12, grid=TRUE)
}
build_a1c_plot <- function(dat, title_suffix="") {
ggplot(dat, aes(x=a1c, y=diet_ter, fill=diet_ter)) +
ggridges::geom_density_ridges(scale=1.2, rel_min_height=0.01,
alpha=0.90, color="grey25") +
geom_vline(xintercept=7, linetype=2, linewidth=0.7, color="firebrick") +
scale_fill_manual(values=c("#dcfce7","#bbf7d0","#86efac"), guide="none") +
labs(title=paste0("A1c distributions by Diet quality", title_suffix),
subtitle="Dashed line at 7%",
x="Hemoglobin A1c (%)", y=NULL) +
theme_ridges(font_size=12, grid=TRUE)
}
make_pir_panel <- function(pir_label) {
sbp_sub <- ridg_income %>% filter(pir3 == pir_label, !is.na(diet_ter), !is.na(sbp))
a1c_sub <- ridg_income %>% filter(pir3 == pir_label, !is.na(diet_ter), !is.na(a1c))
left <- if (nrow(sbp_sub))
build_sbp_plot(sbp_sub, paste0(" (", pir_label, "; n=", nrow(sbp_sub), ")")) else NULL
right <- if (nrow(a1c_sub))
build_a1c_plot(a1c_sub, paste0(" (", pir_label, "; n=", nrow(a1c_sub), ")")) else NULL
fig <- if (!is.null(left) && !is.null(right)) left | right
else if (!is.null(left)) left
else if (!is.null(right)) right
else ggplot() + theme_void() + ggtitle(paste0(pir_label, " — no data"))
fig + plot_annotation(
title = paste0("Diet × Clinical distributions — ", pir_label),
theme = theme(plot.title = element_text(size=14, face="bold", hjust=0.5))
)
}
panel_pir_low <- make_pir_panel("Low PIR")
panel_pir_mid <- make_pir_panel("Mid PIR")
panel_pir_high <- make_pir_panel("High PIR")
print(panel_pir_low)
## Picking joint bandwidth of 4.47
## Picking joint bandwidth of 0.133
print(panel_pir_mid)
## Picking joint bandwidth of 4.36
## Picking joint bandwidth of 0.122
print(panel_pir_high)
## Picking joint bandwidth of 3.89
## Picking joint bandwidth of 0.1
Numerical analysis: To make these visual impressions more concrete, I pair each set of ridges with simple numeric summaries: group-wise means and standard deviations, plus the percent of people above clinically relevant cutoffs (A1c ≥ 6% and ≥ 7%; SBP ≥ 140 or DBP ≥ 90). These tables give us a way to quantify whether, for example, “leaner” ridges in high-diet groups really correspond to meaningfully lower risk, or whether the distributions mostly overlap.
# Helper: summarise A1c within arbitrary grouping structure
summ_a1c <- function(dat, group_vars,
cut_lo = 6, cut_hi = 7) {
dat %>%
dplyr::filter(!is.na(a1c)) %>%
dplyr::group_by(dplyr::across(all_of(group_vars))) %>%
dplyr::summarise(
n = dplyr::n(),
mean_a1c = mean(a1c),
sd_a1c = sd(a1c),
pct_ge_lo = 100 * mean(a1c >= cut_lo),
pct_ge_hi = 100 * mean(a1c >= cut_hi),
.groups = "drop"
)
}
# Helper: summarise BP within arbitrary grouping structure
summ_bp <- function(dat, group_vars,
sbp_cut = 140, dbp_cut = 90) {
dat %>%
dplyr::filter(!is.na(sbp), !is.na(dbp)) %>%
dplyr::group_by(dplyr::across(all_of(group_vars))) %>%
dplyr::summarise(
n = dplyr::n(),
mean_sbp = mean(sbp),
sd_sbp = sd(sbp),
mean_dbp = mean(dbp),
sd_dbp = sd(dbp),
pct_unctrl = 100 * mean(sbp >= sbp_cut | dbp >= dbp_cut),
.groups = "drop"
)
}
# ---- Match the ridge-plot facets ----
# (Adjust group_vars to align with the stratification you used in Panel C.)
# Example 1: Diet × Income (PIR) ridges
a1c_pir_diet <- summ_a1c(df1, c("pir_grp", "diet_ter"))
bp_pir_diet <- summ_bp(df1, c("pir_grp", "diet_ter"))
# Example 2: Diet × Age ridges
a1c_age_diet <- summ_a1c(df1, c("age3", "diet_ter"))
bp_age_diet <- summ_bp(df1, c("age3", "diet_ter"))
# If you have ridges stratified by behavioral load index (BLI), you can also do:
# a1c_bli_diet <- summ_a1c(df1, c("bli_quint", "diet_ter"))
# bp_bli_diet <- summ_bp(df1, c("bli_quint", "diet_ter"))
# Optionally, print as compact tables in the notebook
knitr::kable(a1c_pir_diet, digits = 1,
caption = "A1c distribution by PIR × diet tertile")
| pir_grp | diet_ter | n | mean_a1c | sd_a1c | pct_ge_lo | pct_ge_hi |
|---|---|---|---|---|---|---|
| Low PIR | Low diet score | 463 | 5.9 | 1.4 | 25.3 | 11.4 |
| Low PIR | Mid diet score | 423 | 5.8 | 1.3 | 22.2 | 6.9 |
| Low PIR | High diet score | 416 | 5.9 | 1.2 | 26.4 | 11.3 |
| Low PIR | NA | 458 | 5.7 | 1.1 | 19.4 | 8.3 |
| Mid PIR | Low diet score | 556 | 5.8 | 1.0 | 26.1 | 10.6 |
| Mid PIR | Mid diet score | 516 | 5.7 | 1.1 | 19.6 | 6.0 |
| Mid PIR | High diet score | 485 | 5.7 | 0.9 | 20.2 | 6.2 |
| Mid PIR | NA | 416 | 5.6 | 1.1 | 18.0 | 5.5 |
| High PIR | Low diet score | 530 | 5.7 | 0.9 | 20.0 | 6.0 |
| High PIR | Mid diet score | 563 | 5.6 | 0.9 | 13.5 | 4.6 |
| High PIR | High diet score | 705 | 5.5 | 0.7 | 12.9 | 3.1 |
| High PIR | NA | 342 | 5.6 | 1.0 | 16.4 | 7.9 |
| NA | Low diet score | 189 | 5.7 | 0.8 | 19.0 | 6.9 |
| NA | Mid diet score | 190 | 5.8 | 1.0 | 22.1 | 7.9 |
| NA | High diet score | 212 | 5.8 | 1.1 | 22.6 | 8.0 |
| NA | NA | 251 | 5.8 | 1.2 | 21.5 | 8.4 |
knitr::kable(bp_pir_diet, digits = 1,
caption = "BP distribution by PIR × diet tertile")
| pir_grp | diet_ter | n | mean_sbp | sd_sbp | mean_dbp | sd_dbp | pct_unctrl |
|---|---|---|---|---|---|---|---|
| Low PIR | Low diet score | 535 | 119.1 | 18.8 | 72.8 | 12.2 | 15.5 |
| Low PIR | Mid diet score | 488 | 117.8 | 19.8 | 71.5 | 12.6 | 16.0 |
| Low PIR | High diet score | 459 | 120.2 | 20.5 | 73.1 | 12.1 | 20.3 |
| Low PIR | NA | 555 | 116.2 | 19.4 | 70.6 | 12.4 | 12.1 |
| Mid PIR | Low diet score | 627 | 120.4 | 18.0 | 72.6 | 11.2 | 17.5 |
| Mid PIR | Mid diet score | 586 | 119.5 | 18.5 | 72.0 | 11.9 | 16.0 |
| Mid PIR | High diet score | 530 | 120.3 | 18.4 | 72.2 | 11.1 | 15.1 |
| Mid PIR | NA | 475 | 118.0 | 17.4 | 71.9 | 11.7 | 12.6 |
| High PIR | Low diet score | 576 | 119.7 | 16.2 | 73.7 | 10.9 | 14.1 |
| High PIR | Mid diet score | 618 | 119.5 | 16.4 | 73.4 | 10.7 | 12.8 |
| High PIR | High diet score | 740 | 119.7 | 16.9 | 72.1 | 10.0 | 12.3 |
| High PIR | NA | 381 | 117.7 | 16.9 | 71.8 | 11.7 | 12.9 |
| NA | Low diet score | 215 | 119.1 | 18.9 | 71.5 | 11.5 | 15.8 |
| NA | Mid diet score | 217 | 118.8 | 18.5 | 71.6 | 10.5 | 17.1 |
| NA | High diet score | 229 | 119.9 | 17.6 | 71.8 | 10.7 | 15.3 |
| NA | NA | 287 | 119.0 | 19.1 | 70.6 | 11.8 | 13.6 |
knitr::kable(a1c_age_diet, digits = 1,
caption = "A1c distribution by age band × diet tertile")
| age3 | diet_ter | n | mean_a1c | sd_a1c | pct_ge_lo | pct_ge_hi |
|---|---|---|---|---|---|---|
| Younger (18–34) | Low diet score | 299 | 5.3 | 0.6 | 3.3 | 2.0 |
| Younger (18–34) | Mid diet score | 330 | 5.3 | 1.0 | 3.3 | 1.8 |
| Younger (18–34) | High diet score | 284 | 5.2 | 0.3 | 2.1 | 0.4 |
| Younger (18–34) | NA | 370 | 5.3 | 0.7 | 5.1 | 1.4 |
| Middle (35–64) | Low diet score | 727 | 5.9 | 1.3 | 26.0 | 10.3 |
| Middle (35–64) | Mid diet score | 697 | 5.8 | 1.2 | 21.7 | 7.3 |
| Middle (35–64) | High diet score | 793 | 5.7 | 1.1 | 19.7 | 8.7 |
| Middle (35–64) | NA | 594 | 5.8 | 1.2 | 22.6 | 8.4 |
| Older (65+) | Low diet score | 536 | 6.1 | 1.0 | 38.2 | 14.2 |
| Older (65+) | Mid diet score | 489 | 6.0 | 1.1 | 30.5 | 9.0 |
| Older (65+) | High diet score | 604 | 5.9 | 0.9 | 30.3 | 7.6 |
| Older (65+) | NA | 279 | 6.2 | 1.2 | 43.4 | 19.4 |
| NA | Low diet score | 176 | 5.3 | 0.3 | 0.0 | 0.0 |
| NA | Mid diet score | 176 | 5.2 | 0.3 | 1.1 | 0.0 |
| NA | High diet score | 137 | 5.2 | 0.3 | 1.5 | 0.0 |
| NA | NA | 224 | 5.2 | 0.3 | 0.0 | 0.0 |
knitr::kable(bp_age_diet, digits = 1,
caption = "BP distribution by age band × diet tertile")
| age3 | diet_ter | n | mean_sbp | sd_sbp | mean_dbp | sd_dbp | pct_unctrl |
|---|---|---|---|---|---|---|---|
| Younger (18–34) | Low diet score | 323 | 112.6 | 12.0 | 72.0 | 10.0 | 4.6 |
| Younger (18–34) | Mid diet score | 346 | 113.3 | 12.2 | 72.1 | 10.7 | 6.6 |
| Younger (18–34) | High diet score | 291 | 110.8 | 11.7 | 70.4 | 9.9 | 4.1 |
| Younger (18–34) | NA | 384 | 112.8 | 11.7 | 70.7 | 9.5 | 4.7 |
| Middle (35–64) | Low diet score | 738 | 122.2 | 15.7 | 77.9 | 10.7 | 17.2 |
| Middle (35–64) | Mid diet score | 702 | 121.8 | 17.0 | 77.6 | 10.9 | 16.5 |
| Middle (35–64) | High diet score | 806 | 120.8 | 16.8 | 75.7 | 10.5 | 14.6 |
| Middle (35–64) | NA | 595 | 122.7 | 16.5 | 77.9 | 10.9 | 17.3 |
| Older (65+) | Low diet score | 547 | 129.5 | 19.8 | 72.4 | 10.9 | 29.8 |
| Older (65+) | Mid diet score | 502 | 130.0 | 19.2 | 72.5 | 10.5 | 29.7 |
| Older (65+) | High diet score | 614 | 129.5 | 19.1 | 72.9 | 10.6 | 27.2 |
| Older (65+) | NA | 275 | 133.5 | 22.6 | 71.9 | 13.3 | 34.2 |
| NA | Low diet score | 345 | 105.3 | 9.4 | 63.5 | 8.3 | 0.9 |
| NA | Mid diet score | 359 | 103.6 | 9.4 | 61.7 | 7.3 | 0.0 |
| NA | High diet score | 247 | 104.7 | 9.6 | 62.5 | 7.4 | 0.8 |
| NA | NA | 444 | 104.6 | 10.0 | 62.4 | 7.8 | 0.0 |
Summary: Across all ridge plots, the big shifts in risk come from income and age, not diet tertiles. Within each PIR band, A1c and BP means are almost identical across low/mid/high diet, and the share above A1c ≥7% or uncontrolled BP changes by only a few percentage points, confirming that diet tiles in Panel A were mostly showing differential detection rather than large differences in underlying risk. By contrast, moving from younger → middle-aged → older adults shifts the whole A1c and SBP distributions rightward and roughly quintuples the proportion above clinical cutpoints, with only modest leftward nudges for high-diet groups in middle-aged and older adults. Taken together, Panel C suggests that in this cross-section, diet quality is at best a secondary modifier of cardiometabolic risk distributions once we account for age and income.
Panel D Pathway plots: Diagnosis → Treatment → Control of Diabetic Conditions
Up to this point, Panels A–C have told us two things: (1) income, age, and behavior load strongly shape who shows up with elevated A1c or BP, while diet tertiles mainly add small, non-monotone tweaks; and (2) many of the strongest gradients we saw were in access and testing, not necessarily in the underlying risk distributions themselves.
In Panel D we therefore shift from static prevalence to the care pathway: among adults with hypertension or diabetes, where exactly do disparities appear—at diagnosis, treatment uptake, or successful control? We stratify by income and diet (and then by age and diet) to ask whether “better” diet tertiles are associated with fewer people being undiagnosed, greater medication use once diagnosed, or higher rates of control conditional on diagnosis. This helps us see whether diet quality in this cross-section acts more like a marker of health-system engagement (better screening and treatment) or a modifier of biological response once on therapy, and how those patterns differ across income and age bands.
To understand the care pathway I summarize, for diabetes especially:
I first set up common helpers and derive consistent diagnosis, medication, and control flags, which will be reused for both income‐ and age‐stratified plots.
yn2 <- function(x) ifelse(as.numeric(as.character(x)) %in% 1, 1L,
ifelse(as.numeric(as.character(x)) %in% 2, 0L, NA_integer_))
to_num <- function(x) suppressWarnings(as.numeric(as.character(x)))
if (!("pir_grp" %in% names(df1)) || all(is.na(df1$pir_grp))) {
qs_pir <- quantile(df1$pir, c(0, 1/3, 2/3, 1), na.rm = TRUE)
df1$pir_grp <- cut(df1$pir, qs_pir, include.lowest = TRUE,
labels = c("Low PIR","Mid PIR","High PIR"))
}
df1$pir_grp <- forcats::fct_relevel(df1$pir_grp, "Low PIR","Mid PIR","High PIR")
stopifnot("diet_score" %in% names(df1))
if (!("diet_ter" %in% names(df1))) {
df1 <- df1 %>%
dplyr::mutate(
diet_ter = dplyr::ntile(diet_score, 3),
diet_ter = factor(diet_ter,
labels = c("Low diet score","Mid diet score","High diet score"))
)
} else {
df1$diet_ter <- forcats::fct_relevel(df1$diet_ter,
"Low diet score","Mid diet score","High diet score")
}
df1 <- df1 %>%
dplyr::mutate(
dx_htn = if ("BPQ020" %in% names(.)) yn2(BPQ020) else NA_integer_,
dx_dm = dplyr::case_when(
"DIQ010" %in% names(.) ~ ifelse(DIQ010 %in% 1, 1L,
ifelse(DIQ010 %in% c(2,3), 0L, NA_integer_)),
TRUE ~ NA_integer_
),
htn_meds = if ("BPQ050A" %in% names(.)) yn2(BPQ050A) else NA_integer_,
dm_ins = if ("DIQ050" %in% names(.)) yn2(DIQ050) else NA_integer_,
dm_pill = if ("DIQ070" %in% names(.)) yn2(DIQ070) else NA_integer_,
dm_meds = as.integer(dplyr::coalesce(dm_ins,0L) | dplyr::coalesce(dm_pill,0L)),
sbp = to_num(sbp), dbp = to_num(dbp), a1c = to_num(a1c),
htn_ctrl = ifelse(dx_htn == 1L & !is.na(sbp) & !is.na(dbp),
as.integer(sbp < 140 & dbp < 90), NA_integer_),
dm_ctrl = ifelse(dx_dm == 1L & !is.na(a1c),
as.integer(a1c < 7), NA_integer_)
)
df_use <- df1 %>% dplyr::filter(!is.na(pir_grp))
Next, I define a generic pathway summarizer that computes, for each PIR × diet stratum:
how many adults have usable data (n_any),
how many are diagnosed, on meds, and controlled,
and the corresponding percentages for each stage of the pathway.
summ_pathway <- function(dat, x_group = "diet_ter",
dx_var, meds_var, ctrl_var, outcome_label) {
xsym <- rlang::sym(x_group)
base_any <- dat %>%
dplyr::group_by(pir_grp, !!xsym) %>%
dplyr::summarise(
n_any = sum(!is.na(.data[[dx_var]])),
n_total = dplyr::n(),
n_dx_any= sum(.data[[dx_var]] == 1L, na.rm = TRUE),
.groups = "drop"
)
base_dx <- dat %>%
dplyr::filter(.data[[dx_var]] == 1L) %>%
dplyr::group_by(pir_grp, !!xsym) %>%
dplyr::summarise(
n_dx = dplyr::n(),
n_meds = sum(dplyr::coalesce(.data[[meds_var]],0L), na.rm = TRUE),
n_ctrl = sum(dplyr::coalesce(.data[[ctrl_var]],0L), na.rm = TRUE),
.groups = "drop"
)
joined <- base_any %>%
dplyr::left_join(base_dx, by = c("pir_grp", x_group)) %>%
dplyr::mutate(
n_dx = dplyr::coalesce(n_dx, 0L),
n_meds = dplyr::coalesce(n_meds, 0L),
n_ctrl = dplyr::coalesce(n_ctrl, 0L),
pct_dx = 100 * (n_dx_any / pmax(n_any, 1)),
pct_meds = 100 * (n_meds / pmax(n_dx, 1)),
pct_ctrl = 100 * (n_ctrl / pmax(n_dx, 1))
)
long <- joined %>%
tidyr::pivot_longer(
cols = c(pct_ctrl, pct_meds, pct_dx),
names_to = "stage", values_to = "pct"
) %>%
dplyr::mutate(
stage = factor(stage,
levels = c("pct_ctrl","pct_meds","pct_dx"),
labels = c("Controlled (among diagnosed)",
"On meds (among diagnosed)",
"Diagnosed (among all)")),
denom = dplyr::case_when(
stage == "Diagnosed (among all)" ~ n_any,
stage == "On meds (among diagnosed)" ~ n_dx,
stage == "Controlled (among diagnosed)" ~ n_dx
),
outcome = outcome_label
) %>%
dplyr::select(pir_grp, !!xsym, stage, pct, denom, outcome)
long
}
I then apply this pathway summary separately for hypertension and diabetes, and bind them for plotting. The first figure shows the 3-stage pathway by PIR and diet tertile; the second zooms into diabetes control by diet × PIR, split by medication status and making denominators explicit.
htn_path_diet <- summ_pathway(df_use, "diet_ter", "dx_htn","htn_meds","htn_ctrl","Hypertension")
dm_path_diet <- summ_pathway(df_use, "diet_ter", "dx_dm", "dm_meds","dm_ctrl","Diabetes")
path_diet_all <- dplyr::bind_rows(htn_path_diet, dm_path_diet)
fig_path_diet <- ggplot2::ggplot(path_diet_all,
ggplot2::aes(x = stage, y = pct, fill = diet_ter)) +
ggplot2::geom_col(position = ggplot2::position_dodge(width = 0.75), width = 0.7) +
ggplot2::geom_text(
ggplot2::aes(label = sprintf("%.0f%%\n(n=%s)", pct, scales::comma(denom))),
position = ggplot2::position_dodge(width = 0.75),
vjust = -0.35, lineheight = 0.95, size = 3
) +
ggplot2::facet_grid(outcome ~ pir_grp, scales = "free_y") +
ggplot2::scale_fill_brewer(palette = "Blues", direction = -1, name = "Diet quality") +
ggplot2::labs(
title = "Diagnosis → Treatment → Control pathways by income and diet quality",
x = NULL, y = "% of adults"
) +
ggplot2::theme_minimal(base_size = 12) +
ggplot2::theme(
panel.grid.major.x = ggplot2::element_blank(),
axis.text.x = ggplot2::element_text(angle = 10, hjust = 1),
plot.title = ggplot2::element_text(face = "bold")
)
dm_ctrl_by_diet_pir <- df_use %>%
dplyr::filter(dx_dm == 1L) %>%
dplyr::mutate(med_group = ifelse(dm_meds == 1L, "Medicated", "Unmedicated")) %>%
dplyr::group_by(pir_grp, diet_ter, med_group) %>%
dplyr::summarise(
n_dx = dplyr::n(),
n_ctrl = sum(dplyr::coalesce(dm_ctrl,0L), na.rm = TRUE),
pct_ctrl = 100 * n_ctrl / pmax(n_dx, 1),
.groups = "drop"
)
fig_dm_ctrl_diet <- ggplot2::ggplot(dm_ctrl_by_diet_pir,
ggplot2::aes(x = diet_ter, y = pct_ctrl, fill = med_group)) +
ggplot2::geom_col(position = ggplot2::position_dodge(width = 0.75), width = 0.65) +
ggplot2::geom_text(
ggplot2::aes(label = sprintf("%.0f%%\n(n=%s)", pct_ctrl, scales::comma(n_dx))),
position = ggplot2::position_dodge(width = 0.75),
vjust = -0.35, lineheight = 0.95, size = 3
) +
ggplot2::scale_fill_brewer(palette = "Blues", direction = -1, name = NULL) +
ggplot2::facet_wrap(~ pir_grp, nrow = 1) +
ggplot2::labs(
title = "Diabetes control (A1c < 7%) among diagnosed by Diet × Income, split by medication status",
x = "Diet quality (tertiles)", y = "% controlled"
) +
ggplot2::theme_minimal(base_size = 12) +
ggplot2::theme(
panel.grid.major.x = ggplot2::element_blank(),
axis.text.x = ggplot2::element_text(angle = 10, hjust = 1),
plot.title = ggplot2::element_text(face = "bold")
)
fig_path_diet
fig_dm_ctrl_diet
Pathway plots by Age × Diet
Analogously, I build diagnosis–treatment–control pathways by age group and diet tertile, and then look specifically at diabetes control by diet and age, split by medication status.
to_num <- function(x) suppressWarnings(as.numeric(as.character(x)))
yn2 <- function(x) {
z <- to_num(x)
ifelse(z %in% 1, 1L, ifelse(z %in% 2, 0L, NA_integer_))
}
if (!("age3" %in% names(df1))) {
df1 <- df1 %>%
mutate(
RIDAGEYR = if ("RIDAGEYR" %in% names(.)) to_num(RIDAGEYR) else to_num(age),
age3 = case_when(
!is.na(RIDAGEYR) & RIDAGEYR >= 18 & RIDAGEYR < 35 ~ "Younger (18–34)",
!is.na(RIDAGEYR) & RIDAGEYR >= 35 & RIDAGEYR < 65 ~ "Middle (35–64)",
!is.na(RIDAGEYR) & RIDAGEYR >= 65 ~ "Older (65+)"
),
age3 = factor(age3, levels = c("Younger (18–34)","Middle (35–64)","Older (65+)"))
)
}
if (!("diet_ter" %in% names(df1))) {
stop("diet_ter is missing. Create tertiles from diet_score before proceeding.")
} else {
df1$diet_ter <- forcats::fct_relevel(df1$diet_ter,
"Low diet score","Mid diet score","High diet score")
}
df1 <- df1 %>%
mutate(
sbp = if ("sbp" %in% names(.)) to_num(sbp) else NA_real_,
dbp = if ("dbp" %in% names(.)) to_num(dbp) else NA_real_,
a1c = if ("a1c" %in% names(.)) to_num(a1c) else NA_real_,
dx_htn = if ("dx_htn" %in% names(.)) as.integer(dx_htn) else NA_integer_,
htn_meds = if ("htn_meds" %in% names(.)) as.integer(htn_meds) else NA_integer_,
htn_ctrl = if ("htn_ctrl" %in% names(.)) as.integer(htn_ctrl) else NA_integer_,
dx_dm = if ("dx_dm" %in% names(.)) as.integer(dx_dm) else NA_integer_,
dm_meds = if ("dm_meds" %in% names(.)) as.integer(dm_meds) else NA_integer_,
dm_ctrl = if ("dm_ctrl" %in% names(.)) as.integer(dm_ctrl) else NA_integer_
)
df_age <- df1 %>%
filter(!is.na(age3), !is.na(diet_ter)) %>%
select(age3, diet_ter,
dx_htn, htn_meds, htn_ctrl,
dx_dm, dm_meds, dm_ctrl,
sbp, dbp, a1c)
I reuse the same pathway logic, now grouped by age3 × diet_ter. The function structure mirrors the PIR-based version, but with age as the main stratifier.
summ_pathway <- function(dat, dx_var, meds_var, ctrl_var, outcome_label){
den_any <- dat %>%
dplyr::group_by(age3, diet_ter) %>%
dplyr::summarise(n_any = sum(!is.na(.data[[dx_var]])), .groups="drop")
den_dx <- dat %>%
dplyr::filter(.data[[dx_var]] == 1L) %>%
dplyr::group_by(age3, diet_ter) %>%
dplyr::summarise(
n_dx = dplyr::n(),
n_meds = sum(coalesce(.data[[meds_var]],0L), na.rm = TRUE),
n_ctrl = sum(coalesce(.data[[ctrl_var]],0L), na.rm = TRUE),
.groups = "drop"
)
out <- den_any %>%
dplyr::left_join(
dat %>% dplyr::group_by(age3, diet_ter) %>%
dplyr::summarise(n_dx_any = sum(.data[[dx_var]]==1L, na.rm=TRUE), .groups="drop"),
by = c("age3","diet_ter")
) %>%
dplyr::left_join(den_dx, by = c("age3","diet_ter")) %>%
dplyr::mutate(
pct_dx = 100 * (n_dx_any / pmax(n_any, 1)),
pct_meds = 100 * (n_meds / pmax(n_dx, 1)),
pct_ctrl = 100 * (n_ctrl / pmax(n_dx, 1))
) %>%
dplyr::select(age3, diet_ter, n_any, n_dx, pct_dx, pct_meds, pct_ctrl) %>%
tidyr::pivot_longer(cols = starts_with("pct_"),
names_to = "stage", values_to = "pct") %>%
dplyr::mutate(
stage = factor(stage,
levels=c("pct_ctrl","pct_meds","pct_dx"),
labels=c("Controlled (among diagnosed)",
"On meds (among diagnosed)",
"Diagnosed (among all)")),
denom = dplyr::case_when(
stage == "Controlled (among diagnosed)" ~ n_dx,
stage == "On meds (among diagnosed)" ~ n_dx,
stage == "Diagnosed (among all)" ~ n_any
),
outcome = outcome_label
) %>%
dplyr::select(age3, diet_ter, stage, pct, denom, outcome)
out
}
Finally, I generate the age × diet pathway figure and the diabetes control figure, again annotating bars with both percentages and denominators to make interpretation transparent.
path_htn_age <- summ_pathway(df_age, "dx_htn", "htn_meds", "htn_ctrl", "Hypertension")
path_dm_age <- summ_pathway(df_age, "dx_dm", "dm_meds", "dm_ctrl", "Diabetes")
path_all_age <- dplyr::bind_rows(path_htn_age, path_dm_age)
diet_pal <- c("Low diet score"="#CBD5E1","Mid diet score"="#94A3B8","High diet score"="#475569")
p_path_diet_age <- ggplot(path_all_age,
aes(x = stage, y = pct, fill = diet_ter)) +
geom_col(position = position_dodge(width = 0.75), width = 0.7) +
geom_text(aes(label = sprintf("%.0f%%\n(n=%s)", pct, scales::comma(denom))),
position = position_dodge(width = 0.75),
vjust = -0.35, lineheight = 0.95, size = 3) +
facet_grid(outcome ~ age3, scales = "free_y") +
scale_fill_manual(values = diet_pal, name = "Diet quality (tertiles)") +
labs(title = "Diagnosis → Treatment → Control pathways by Age and Diet quality",
x = NULL, y = "% of adults") +
theme_minimal(base_size = 12) +
theme(panel.grid.major.x = element_blank(),
axis.text.x = element_text(angle = 10, hjust = 1),
plot.title = element_text(face = "bold"))
dm_ctrl_age <- df_age %>%
dplyr::filter(dx_dm == 1L) %>%
dplyr::mutate(med_group = ifelse(dm_meds == 1L, "Medicated", "Unmedicated")) %>%
dplyr::group_by(age3, diet_ter, med_group) %>%
dplyr::summarise(
n_dx = dplyr::n(),
n_ctrl = sum(coalesce(dm_ctrl,0L), na.rm = TRUE),
pct_ctrl = 100 * n_ctrl / pmax(n_dx, 1),
.groups = "drop"
)
p_dm_ctrl_diet_age <- ggplot(dm_ctrl_age,
aes(x = diet_ter, y = pct_ctrl, fill = med_group)) +
geom_col(position = position_dodge(width = 0.75), width = 0.65) +
geom_text(aes(label = sprintf("%.0f%%\n(n=%s)", pct_ctrl, scales::comma(n_dx))),
position = position_dodge(width = 0.75),
vjust = -0.35, lineheight = 0.95, size = 3) +
scale_fill_brewer(palette = "Blues", direction = -1, name = NULL) +
facet_wrap(~ age3, nrow = 1) +
labs(title = "Diabetes control (A1c < 7%) among diagnosed by Diet quality × Age group",
x = "Diet quality (tertiles)", y = "% controlled") +
theme_minimal(base_size = 12) +
theme(panel.grid.major.x = element_blank(),
axis.text.x = element_text(angle = 10, hjust = 1),
plot.title = element_text(face = "bold"))
p_path_diet_age
p_dm_ctrl_diet_age
For summaries, I collapse the pathway plots into tables by group (PIR×diet or age×diet), computing—for each stage (diagnosed, on meds, controlled)—the percentage at each diet tertile and then taking simple high-minus-low diet differences, plus (for diabetes) the medicated vs unmedicated control gap within each stratum.
# 1) Hypertension & diabetes pathways by PIR × diet: High vs Low diet
path_pir_diet_diff <- path_diet_all %>%
filter(outcome %in% c("Hypertension","Diabetes"),
diet_ter %in% c("Low diet score","High diet score")) %>%
select(pir_grp, outcome, diet_ter, stage, pct) %>%
pivot_wider(names_from = diet_ter, values_from = pct) %>%
mutate(diff_hi_minus_lo = `High diet score` - `Low diet score`) %>%
arrange(outcome, pir_grp, stage)
path_pir_diet_diff
## # A tibble: 18 × 6
## pir_grp outcome stage `Low diet score` `High diet score` diff_hi_minus_lo
## <fct> <chr> <fct> <dbl> <dbl> <dbl>
## 1 Low PIR Diabetes Cont… 33.3 42.9 9.52
## 2 Low PIR Diabetes On m… 84.9 89.6 4.66
## 3 Low PIR Diabetes Diag… 15.1 13.7 -1.42
## 4 Mid PIR Diabetes Cont… 44.8 45.5 0.663
## 5 Mid PIR Diabetes On m… 87.5 78.2 -9.32
## 6 Mid PIR Diabetes Diag… 13.5 9.35 -4.17
## 7 High PIR Diabetes Cont… 58.6 54.5 -4.03
## 8 High PIR Diabetes On m… 85.7 84.1 -1.62
## 9 High PIR Diabetes Diag… 11.4 5.42 -5.98
## 10 Low PIR Hypertens… Cont… 61.9 60.2 -1.66
## 11 Low PIR Hypertens… On m… 0 0 0
## 12 Low PIR Hypertens… Diag… 42.6 40.4 -2.18
## 13 Mid PIR Hypertens… Cont… 64.2 66.0 1.83
## 14 Mid PIR Hypertens… On m… 0 0 0
## 15 Mid PIR Hypertens… Diag… 42.2 34.1 -8.08
## 16 High PIR Hypertens… Cont… 74.7 74.4 -0.359
## 17 High PIR Hypertens… On m… 0 0 0
## 18 High PIR Hypertens… Diag… 36.0 28.4 -7.59
# 2) Diabetes control among diagnosed, by PIR × diet and med status
dm_ctrl_pir_diet_med <- dm_ctrl_by_diet_pir %>%
select(pir_grp, diet_ter, med_group, pct_ctrl) %>%
pivot_wider(names_from = med_group, values_from = pct_ctrl) %>%
mutate(diff_med_minus_unmed = Medicated - Unmedicated) %>%
arrange(pir_grp, diet_ter)
dm_ctrl_pir_diet_med
## # A tibble: 12 × 5
## pir_grp diet_ter Medicated Unmedicated diff_med_minus_unmed
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 Low PIR Low diet score 30.4 50 -19.6
## 2 Low PIR Mid diet score 43.5 70.6 -27.1
## 3 Low PIR High diet score 39.1 75 -35.9
## 4 Low PIR <NA> 15.5 47.6 -32.2
## 5 Mid PIR Low diet score 39.3 83.3 -44.0
## 6 Mid PIR Mid diet score 49.2 75 -25.8
## 7 Mid PIR High diet score 37.2 75 -37.8
## 8 Mid PIR <NA> 25.4 21.1 4.32
## 9 High PIR Low diet score 56.7 70 -13.3
## 10 High PIR Mid diet score 38.5 85.7 -47.3
## 11 High PIR High diet score 48.6 85.7 -37.1
## 12 High PIR <NA> 12 69.2 -57.2
## ---- panelD_numbers_age_diet --------------------------------------------
# 3) Hypertension & diabetes pathways by Age × diet: High vs Low diet
path_age_diet_diff <- path_all_age %>%
filter(outcome %in% c("Hypertension","Diabetes"),
diet_ter %in% c("Low diet score","High diet score")) %>%
select(age3, outcome, diet_ter, stage, pct) %>%
pivot_wider(names_from = diet_ter, values_from = pct) %>%
mutate(diff_hi_minus_lo = `High diet score` - `Low diet score`) %>%
arrange(outcome, age3, stage)
path_age_diet_diff
## # A tibble: 18 × 6
## age3 outcome stage `Low diet score` `High diet score` diff_hi_minus_lo
## <fct> <chr> <fct> <dbl> <dbl> <dbl>
## 1 Younger (1… Diabet… Cont… 33.3 60 26.7
## 2 Younger (1… Diabet… On m… 50 80 30
## 3 Younger (1… Diabet… Diag… 1.80 1.68 -0.124
## 4 Middle (35… Diabet… Cont… 47.1 31.1 -15.9
## 5 Middle (35… Diabet… On m… 86.8 86.7 -0.0980
## 6 Middle (35… Diabet… Diag… 18.1 11.0 -7.10
## 7 Older (65+) Diabet… Cont… 45.5 56.0 10.5
## 8 Older (65+) Diabet… On m… 88.1 86.2 -1.87
## 9 Older (65+) Diabet… Diag… 25.5 17.4 -8.15
## 10 Younger (1… Hypert… Cont… 69.2 80 10.8
## 11 Younger (1… Hypert… On m… 0 0 0
## 12 Younger (1… Hypert… Diag… 7.83 8.39 0.558
## 13 Middle (35… Hypert… Cont… 70.0 70.5 0.572
## 14 Middle (35… Hypert… On m… 0 0 0
## 15 Middle (35… Hypert… Diag… 40.3 29.5 -10.9
## 16 Older (65+) Hypert… Cont… 63.8 64.1 0.253
## 17 Older (65+) Hypert… On m… 0 0 0
## 18 Older (65+) Hypert… Diag… 63.2 53.7 -9.47
# 4) Diabetes control among diagnosed, by Age × diet and med status
dm_ctrl_age_diet_med <- dm_ctrl_age %>%
select(age3, diet_ter, med_group, pct_ctrl) %>%
pivot_wider(names_from = med_group, values_from = pct_ctrl) %>%
mutate(diff_med_minus_unmed = Medicated - Unmedicated) %>%
arrange(age3, diet_ter)
dm_ctrl_age_diet_med
## # A tibble: 9 × 5
## age3 diet_ter Medicated Unmedicated diff_med_minus_unmed
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 Younger (18–34) Low diet score 0 66.7 -66.7
## 2 Younger (18–34) Mid diet score 16.7 66.7 -50
## 3 Younger (18–34) High diet score 50 100 -50
## 4 Middle (35–64) Low diet score 42.4 77.8 -35.4
## 5 Middle (35–64) Mid diet score 44.2 80 -35.8
## 6 Middle (35–64) High diet score 24.4 75 -50.6
## 7 Older (65+) Low diet score 42.9 64.7 -21.8
## 8 Older (65+) Mid diet score 42.9 66.7 -23.8
## 9 Older (65+) High diet score 52.1 80 -27.9
Summary: Across PIR strata, people with high diet scores are consistently less likely to carry a diabetes diagnosis than low-diet peers at the same income level (≈4–6 percentage-point lower prevalence in mid and high PIR), while differences in control among those diagnosed are modest: in low PIR, high-diet diabetics are about 10 pp more likely to be controlled (43% vs 33%), but in mid and high PIR control is similar or slightly worse for the high-diet group. Age patterns echo this: diagnosis rates are much higher in middle and older adults, and within these bands high diet is associated with ~7–8 pp lower diabetes prevalence, whereas in young adults prevalence is ~2% regardless of diet but high-diet youth who are diagnosed show markedly higher medication use and control (≈60% vs 33% controlled, with small n). Finally, comparing medicated vs unmedicated patients within each diet × PIR or diet × age cell, control is always much lower among those on medication (often 20–60 pp), which almost certainly reflects confounding by indication (sicker patients are the ones treated) rather than a causal effect of medication. Taken together, Panel D supports the story that diet quality is more strongly linked to who ends up with diabetes in the first place than to how well diabetes is controlled once a person is in care, with income and age doing most of the work in shaping the control gradients.
Figure E. Ternary plots: diet composition vs A1c and SBP
Up to this point, we have treated “diet quality” as a single scalar score and asked how it interacts with income, age, and behavior load. That lens is useful for equity questions, but it hides what about diet might matter biologically: the balance between overall quality, total energy intake, and sodium load. Public health messaging often collapses these dimensions into a vague “good diet / bad diet” binary; here we explicitly separate them to see whether high-risk individuals cluster in particular corners of the diet–energy–sodium space.
To do this, I construct ternary scatterplots where each point represents an individual’s relative composition of diet quality, calories, and sodium:
Within each sex × age band (age3 × sex2), I convert the diet score, mean daily kcal, and mean daily sodium (averaged across the two 24-hr recalls) into percentile ranks,
Renormalize those three ranks so they sum to 1, giving ternary coordinates for diet, energy, and sodium share,
Then facet the ternary simplex by sex and age band.
In the A1c ternary plot, people with A1c ≥ 6% are colored by A1c and others are grey; in the SBP ternary plot, people in the top SBP decile (or flagged as uncontrolled) are colored by SBP and others are grey. These plots let us ask: among individuals at the highest metabolic or blood-pressure risk, do we see consistent shifts toward “high sodium / high energy” corners, or towards “high diet quality but still high energy/sodium,” beyond what a simple diet score captures?
to_num <- function(x) suppressWarnings(as.numeric(as.character(x)))
row_mean <- function(data, cols){
if (!length(cols)) return(rep(NA_real_, nrow(data)))
mats <- lapply(cols, function(cn) suppressWarnings(as.numeric(data[[cn]])))
rowMeans(as.data.frame(mats), na.rm = TRUE)
}
if (!("age3" %in% names(df1))) {
df1 <- df1 %>%
mutate(
RIDAGEYR = to_num(RIDAGEYR),
age3 = case_when(
!is.na(RIDAGEYR) & RIDAGEYR >= 18 & RIDAGEYR < 35 ~ "Younger (18–34)",
!is.na(RIDAGEYR) & RIDAGEYR >= 35 & RIDAGEYR < 65 ~ "Middle (35–64)",
!is.na(RIDAGEYR) & RIDAGEYR >= 65 ~ "Older (65+)"
),
age3 = factor(age3, levels = c("Younger (18–34)","Middle (35–64)","Older (65+)"))
)
}
df1 <- df1 %>%
mutate(sex2 = case_when(
RIAGENDR %in% 1 ~ "Male",
RIAGENDR %in% 2 ~ "Female",
TRUE ~ NA_character_
)) %>%
mutate(sex2 = factor(sex2, levels=c("Male","Female")))
kcal_cols <- grep("^DR[12].*KCAL$", names(df1), value = TRUE)
sod_cols <- grep("^DR[12].*(SODI|SOD)$", names(df1), value = TRUE)
df1 <- df1 %>%
mutate(
kcal = row_mean(cur_data(), kcal_cols),
sodium = row_mean(cur_data(), sod_cols),
a1c = to_num(a1c)
)
library(dplyr)
plot_dat <- df1 %>%
filter(!is.na(sex2), !is.na(age3)) %>%
filter(rowSums(is.na(select(., diet_score, kcal, sodium))) < 3) %>%
group_by(sex2, age3) %>%
mutate(
r_diet = percent_rank(diet_score),
r_kcal = percent_rank(kcal),
r_sod = percent_rank(sodium)
) %>%
ungroup() %>%
mutate(
s = r_diet + r_kcal + r_sod,
t_diet = ifelse(s > 0, r_diet / s, NA_real_),
t_kcal = ifelse(s > 0, r_kcal / s, NA_real_),
t_sod = ifelse(s > 0, r_sod / s, NA_real_)
) %>%
filter(is.finite(t_diet), is.finite(t_kcal), is.finite(t_sod))
plot_low <- plot_dat %>% filter(is.na(a1c) | a1c < 6)
plot_high <- plot_dat %>% filter(!is.na(a1c) & a1c >= 6)
p_tern_a1c <- ggtern() +
geom_point(
data = plot_low,
aes(x = t_kcal, y = t_diet, z = t_sod),
color = "grey80", size = 1.2, alpha = 0.55, na.rm = TRUE
) +
geom_point(
data = plot_high,
aes(x = t_kcal, y = t_diet, z = t_sod, color = a1c),
size = 1.2, alpha = 0.75, na.rm = TRUE
) +
facet_grid(sex2 ~ age3) +
Tlab("Energy rank") + Llab("Diet score rank") + Rlab("Sodium rank") +
scale_color_viridis_c(option = "magma", direction = -1, name = "A1c (%)") +
theme_bw(base_size = 11) +
theme(
legend.position = "right",
strip.text = element_text(face = "bold")
) +
labs(
title = "Diet score vs Energy vs Sodium (rank-normalized)",
subtitle = "Six subgroups (Male/Female × age strata). A1c < 6 shown in grey; ≥6 gets purple scale",
caption = "Each axis is a within-subgroup percentile rank; components renormalized to sum to 1."
)
## Warning in geom_point(data = plot_low, aes(x = t_kcal, y = t_diet, z = t_sod),
## : Ignoring unknown aesthetics: z
## Warning in geom_point(data = plot_high, aes(x = t_kcal, y = t_diet, z = t_sod,
## : Ignoring unknown aesthetics: z
p_tern_a1c
## Ignoring unknown labels:
## • R : "Sodium rank"
## • Rarrow : "Sodium rank"
## • L : "Diet score rank"
## • Larrow : "Diet score rank"
## • T : "Energy rank"
## • Tarrow : "Energy rank"
12.2 SBP ternary plot: diet score vs energy vs sodium
For SBP, I reuse the same ternary coordinate system (diet score, energy, sodium ranks) but:
Rebuild plot_dat only if it does not already exist, to avoid recomputation.
Define “controlled” vs “uncontrolled” BP based on SBP/DBP ≥ 140/90.
Clip extreme SBP values for color scaling and plot uncontrolled individuals in color, with all others in grey.
if (!exists("plot_dat")) {
library(dplyr)
to_num <- function(x) suppressWarnings(as.numeric(as.character(x)))
rank01 <- function(x) ifelse(is.na(x), NA_real_, (rank(x, na.last="keep", ties.method="average")-0.5)/sum(!is.na(x)))
renorm3 <- function(a,b,c){
s <- a+b+c
tibble(
t_diet = ifelse(s>0, a/s, NA_real_),
t_kcal = ifelse(s>0, b/s, NA_real_),
t_sod = ifelse(s>0, c/s, NA_real_)
)
}
find_cols <- function(patterns) {
cn <- names(df1)
keep <- unlist(lapply(patterns, function(p) grep(p, cn, ignore.case = TRUE, value = TRUE)))
unique(keep)
}
row_mean <- function(data, cols) {
if (!length(cols)) return(rep(NA_real_, nrow(data)))
m <- as.data.frame(lapply(cols, function(cn) suppressWarnings(as.numeric(data[[cn]]))))
rowMeans(m, na.rm = TRUE)
}
kcal_cols <- find_cols(c("^DR1TKCAL$","^DR2TKCAL$"))
sod_cols <- find_cols(c("^DR1TSODI$","^DR2TSODI$"))
stopifnot("diet_score" %in% names(df1))
df_tmp <- df1 %>%
mutate(
kcal = row_mean(cur_data(), kcal_cols),
sodium = row_mean(cur_data(), sod_cols),
sex2 = case_when(!!as.name("RIAGENDR") %in% 1 ~ "Male",
!!as.name("RIAGENDR") %in% 2 ~ "Female",
TRUE ~ "Unknown"),
age3 = case_when(!is.na(RIDAGEYR) & RIDAGEYR >= 18 & RIDAGEYR < 35 ~ "Younger (18–34)",
!is.na(RIDAGEYR) & RIDAGEYR >= 35 & RIDAGEYR < 65 ~ "Middle (35–64)",
!is.na(RIDAGEYR) & RIDAGEYR >= 65 ~ "Older (65+)",
TRUE ~ NA_character_),
sex2 = factor(sex2, levels=c("Male","Female")),
age3 = factor(age3, levels=c("Younger (18–34)","Middle (35–64)","Older (65+)"))
) %>%
filter(!is.na(sex2), !is.na(age3))
plot_dat <- df_tmp %>%
group_by(sex2, age3) %>%
mutate(
r_diet = rank01(diet_score),
r_kcal = rank01(kcal),
r_sod = rank01(sodium)
) %>%
ungroup() %>%
bind_cols(renorm3(.$r_diet, .$r_kcal, .$r_sod)) %>%
mutate(
sbp = to_num(sbp),
dbp = to_num(dbp)
)
}
bp_low <- plot_dat %>%
filter(is.na(sbp) | is.na(dbp) | (sbp < 140 & dbp < 90))
bp_high <- plot_dat %>%
filter(!is.na(sbp) & !is.na(dbp) & (sbp >= 140 | dbp >= 90)) %>%
mutate(sbp_clip = pmax(pmin(sbp, 200), 90))
p_tern_bp <- ggtern() +
geom_point(
data = bp_low,
aes(x = t_kcal, y = t_diet, z = t_sod),
color = "grey80", size = 1.2, alpha = 0.55, na.rm = TRUE
) +
geom_point(
data = bp_high,
aes(x = t_kcal, y = t_diet, z = t_sod, color = sbp_clip),
size = 1.2, alpha = 0.8, na.rm = TRUE
) +
facet_grid(sex2 ~ age3) +
Tlab("Energy rank") + Llab("Diet score rank") + Rlab("Sodium rank") +
scale_color_viridis_c(option = "magma", direction = -1, name = "SBP (mmHg)") +
theme_bw(base_size = 11) +
theme(
legend.position = "right",
strip.text = element_text(face = "bold")
) +
labs(
title = "Diet score vs Energy vs Sodium (rank-normalized)",
subtitle = "Six subgroups (Male/Female × age strata). Grey = SBP<140 & DBP<90; colored = uncontrolled",
caption = "Each axis is a within-subgroup percentile rank; components renormalized to sum to 1."
)
## Warning in geom_point(data = bp_low, aes(x = t_kcal, y = t_diet, z = t_sod), :
## Ignoring unknown aesthetics: z
## Warning in geom_point(data = bp_high, aes(x = t_kcal, y = t_diet, z = t_sod, :
## Ignoring unknown aesthetics: z
p_tern_bp
## Ignoring unknown labels:
## • R : "Sodium rank"
## • Rarrow : "Sodium rank"
## • L : "Diet score rank"
## • Larrow : "Diet score rank"
## • T : "Energy rank"
## • Tarrow : "Energy rank"
Numerical Analysis: For each outcome (A1c, SBP), I identify the
top 5% of values and compare their average ternary composition (mean
diet/energy/sodium ranks) with the remaining 90%, overall and optionally
within age or income strata.
# --- Build ternary coords: diet score, kcal, sodium -----------------
diet_ternary <- df1 %>%
mutate(
# average across recall days (change SODI names if yours differ)
kcal_mean = rowMeans(across(c(DR1TKCAL, DR2TKCAL)), na.rm = TRUE),
sodium_mean = rowMeans(across(c(DR1TSODI, DR2TSODI)), na.rm = TRUE)
) %>%
group_by(sex2, age3) %>%
mutate(
r_diet = percent_rank(diet_score),
r_kcal = percent_rank(kcal_mean),
r_sodium = percent_rank(sodium_mean),
r_sum = r_diet + r_kcal + r_sodium,
tern_diet = r_diet / r_sum,
tern_kcal = r_kcal / r_sum,
tern_sodium = r_sodium / r_sum
) %>%
ungroup()
# --- A1c: top 10% vs others ----------------------------------------
a1c_cut <- quantile(diet_ternary$a1c, 0.95, na.rm = TRUE)
a1c_comp <- diet_ternary %>%
filter(!is.na(a1c)) %>%
mutate(top_a1c = a1c >= a1c_cut) %>%
group_by(top_a1c) %>%
summarise(
n = n(),
mean_diet = mean(tern_diet, na.rm = TRUE),
mean_kcal = mean(tern_kcal, na.rm = TRUE),
mean_sodium= mean(tern_sodium, na.rm = TRUE)
)
a1c_comp
## # A tibble: 2 × 5
## top_a1c n mean_diet mean_kcal mean_sodium
## <lgl> <int> <dbl> <dbl> <dbl>
## 1 FALSE 6372 0.360 0.319 0.321
## 2 TRUE 343 0.376 0.292 0.332
# --- SBP: top 10% vs others ----------------------------------------
sbp_cut <- quantile(diet_ternary$sbp, 0.95, na.rm = TRUE)
sbp_comp <- diet_ternary %>%
filter(!is.na(sbp)) %>%
mutate(top_sbp = sbp >= sbp_cut) %>%
group_by(top_sbp) %>%
summarise(
n = n(),
mean_diet = mean(tern_diet, na.rm = TRUE),
mean_kcal = mean(tern_kcal, na.rm = TRUE),
mean_sodium= mean(tern_sodium, na.rm = TRUE)
)
sbp_comp
## # A tibble: 2 × 5
## top_sbp n mean_diet mean_kcal mean_sodium
## <lgl> <int> <dbl> <dbl> <dbl>
## 1 FALSE 7138 0.353 0.321 0.326
## 2 TRUE 380 0.403 0.305 0.292
The ternary summaries suggest subtle but real composition shifts in the high-risk tails. For A1c, the top 5% (n = 343) tilt slightly toward higher diet-score share (0.38 vs 0.36) and lower energy share (0.29 vs 0.32), with sodium nearly unchanged (0.33 vs 0.32). For SBP, the top 10% (n = 380) show a clearer move toward diet quality (0.40 vs 0.35) and away from sodium (0.29 vs 0.33), with a small drop in energy (0.31 vs 0.32). These shifts are modest in magnitude, but they hint that risk may concentrate in particular diet–energy–sodium mixes rather than in diet score alone—patterns that would need finer stratification and modeling to characterize more definitively.
Conclusion - Across tiles and ridges, income and age drive
the steepest gradients in diagnosis and uncontrolled BP; diet tertiles
mostly add small adjustments once we condition on these factors.
- Within BLI, PIR, and age strata, moving from low → high diet quality
only modestly shifts A1c and SBP, and often tracks more with care
exposure (access, blood testing, medication use) than with clearly lower
underlying risk.
- Pathway plots suggest that better diet is associated with somewhat
higher diabetes control, especially among older and
higher-income adults on medication, but these gains are modest
and unevenly distributed.
- Taken together, the project suggests that “diet quality” in this
dataset is entangled with structural factors (income, age,
access, treatment) rather than acting as a clean standalone
lever, pointing future work toward disentangling behavior from access
and testing policy- or clinic-level interventions, not just individual
diet change.
Analytic limitations:
Cross-sectional and descriptive: NHANES is used here in a cross-sectional way, so we can describe patterns but cannot infer causal effects of diet on BP or A1c, or the direction of these relationships.
Survey design and representativeness: For simplicity we treated the data as a simple random sample and did not apply NHANES survey weights or design corrections, so absolute prevalences and differences are not nationally representative and have unknown sampling uncertainty.
Crude constructs and measurement error:
No multivariable adjustment: All tiles, ridges, and pathways are unadjusted summaries; apparent diet gradients may partly reflect unmeasured confounding (insurance, comorbidities, race/ethnicity, clinic access). More formal models would be needed to isolate effects conditional on these factors.
Despite these constraints, the combination of structured heatmaps, distributional ridges, and pathway plots still surfaces clear, interpretable patterns—showing how diet tends to co-vary with income, age, and behavior load in ways that shape diagnosis and control. As a scaffold, this analysis sets up a concrete, reproducible framework that future work can extend with proper survey weighting, richer diet metrics, and fully adjusted models.